Introduction to Open Data Science - Course Project

About the project

My git repository

date()
## [1] "Mon Dec 11 19:16:54 2023"

Heads-up: I am new with R. During this first week, I studied the 4 first chapters of the book made on R studio and watched the given tutorial videos.

I have learned how to:
- set my directory
- use the R project mode, the R markdown file and github commit
- install and use libraries
- identify and use the different data types including the factors
- use the pipe %>% (it means “and then…”)
- open data set, read data set, read the structure of the data set
- select rows, columns and match text or rename column name
- filter data based on variable values
- create a table / tibble
- create an object that is reusable later
- use the time format
- add columns with text, int, calculations, categories etc.
- group data with one or many variables to then perform calculation
- combine all data types to create a string with the function paste()
- check the NA and remove the NA to do calculations
- summarize to sum up data per variable value
- manipulate table from wide to long and long to wide
- combine 2 tables (functions: bind, inner_join, full_join, left_join, right_join)
- use the most common arithmetic functions and use percentage
- rank and reorder value
- create plots (bar, col, line, point, box, histogram)
- play with the different style and aesthetic possibilities
- add labels to plots
- combine plots together

With this tool I discovered how easy and fast it can be to manipulate and visualize data from a big data set.

Conclusion of Week 1:
The book is really well made. The style is easy to follow and understand. We can learn at our own pace and try the examples in R studio to assimilate the concepts well.
I am new with R and I find it hard to learn so many new different concepts in a week. I would have liked to have more exercises, without the solutions directly written under and with a tool to correct our lines of code. Finally, I understand it is not necessary to know by heart every function, but I would like to understand well the capabilities of this tool by practicing and doing more real life exercises.

Lines of code to summarize learning in week 1:

# Get the data set
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(scales)
## 
## Attaching package: 'scales'
## 
## The following object is masked from 'package:purrr':
## 
##     discard
## 
## The following object is masked from 'package:readr':
## 
##     col_factor
gbd_full <- read_csv("https://raw.githubusercontent.com/KimmoVehkalahti/RHDS/master/data/global_burden_disease_cause-year-sex-income.csv")
## Rows: 168 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): cause, sex, income
## dbl (2): year, deaths_millions
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# This is a code I wrote to know the pourcentage of deaths per cause for each year.
exercise4 <- gbd_full %>% 
  group_by(year, cause) %>% 
  summarise(total_per_cause= sum(deaths_millions)) %>% 
  group_by(year) %>% 
  mutate(total_per_year =sum(total_per_cause)) %>% 
  mutate(percentage = percent(total_per_cause/total_per_year)) %>% 
  select(year,cause,percentage) %>% 
  pivot_wider(names_from = cause, values_from = percentage)
## `summarise()` has grouped output by 'year'. You can override using the
## `.groups` argument.
exercise4
## # A tibble: 7 × 4
## # Groups:   year [7]
##    year `Communicable diseases` Injuries `Non-communicable diseases`
##   <dbl> <chr>                   <chr>    <chr>                      
## 1  1990 33%                     9%       58%                        
## 2  1995 31%                     9%       60%                        
## 3  2000 29%                     9%       62%                        
## 4  2005 27%                     9%       64%                        
## 5  2010 24%                     9%       67%                        
## 6  2015 20%                     8%       72%                        
## 7  2017 19%                     8%       73%
# This is a code I wrote to know the number of deaths per sex and income and see the ratio Male / Female for each income type.
gbd_full %>% 
  filter(year ==1990) %>% 
  group_by(sex,income) %>% 
  summarize(deaths_per_sex_income = sum(deaths_millions)) %>% 
  pivot_wider(names_from = sex, values_from = deaths_per_sex_income) %>% 
  mutate(diff_M_F = Male/Female)
## `summarise()` has grouped output by 'sex'. You can override using the `.groups`
## argument.
## # A tibble: 4 × 4
##   income       Female  Male diff_M_F
##   <chr>         <dbl> <dbl>    <dbl>
## 1 High           4.14  4.46     1.08
## 2 Low            2.22  2.57     1.16
## 3 Lower-Middle   8.47  9.83     1.16
## 4 Upper-Middle   6.68  7.95     1.19

Chapter 2: Regression and model validation

date()
## [1] "Mon Dec 11 19:16:56 2023"
# set directory
setwd("/Users/margot/Desktop/Desktop - MacBook Pro de MARGOT/Open data with R 2023/IODS-project")

Thoughts about this week 2:

After reading all the chapters 1-7, I am now more confident to use R studio. I also understand the language better and I can do research on the web to use new function that I did not know.

It is very exciting to see how efficient is this tool and to think about all the analyzes we can do. I am an open university student and I can already see how to use this tool at work :).

Exercise: WRANGLING Please find the script file in the Github: create_learning2014_week2

Exercise: DATA ANALYSIS

QUESTION 1:

# Using the table from the course.
csv_table_read <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/learning2014.txt", sep = ",", header = T)

# library
library("tidyverse")
library("finalfit")
library("broom")


csv_table_read
##     gender age attitude     deep  stra     surf points
## 1        F  53      3.7 3.583333 3.375 2.583333     25
## 2        M  55      3.1 2.916667 2.750 3.166667     12
## 3        F  49      2.5 3.500000 3.625 2.250000     24
## 4        M  53      3.5 3.500000 3.125 2.250000     10
## 5        M  49      3.7 3.666667 3.625 2.833333     22
## 6        F  38      3.8 4.750000 3.625 2.416667     21
## 7        M  50      3.5 3.833333 2.250 1.916667     21
## 8        F  37      2.9 3.250000 4.000 2.833333     31
## 9        M  37      3.8 4.333333 4.250 2.166667     24
## 10       F  42      2.1 4.000000 3.500 3.000000     26
## 11       M  37      3.9 3.583333 3.625 2.666667     31
## 12       F  34      3.8 3.833333 4.750 2.416667     31
## 13       F  34      2.4 4.250000 3.625 2.250000     23
## 14       F  34      3.0 3.333333 3.500 2.750000     25
## 15       M  35      2.6 4.166667 1.750 2.333333     21
## 16       F  33      4.1 3.666667 3.875 2.333333     31
## 17       F  32      2.6 4.083333 1.375 2.916667     20
## 18       F  44      2.6 3.500000 3.250 2.500000     22
## 19       M  29      1.7 4.083333 3.000 3.750000      9
## 20       F  30      2.7 4.000000 3.750 2.750000     24
## 21       M  27      3.9 3.916667 2.625 2.333333     28
## 22       M  29      3.4 4.000000 2.375 2.416667     30
## 23       F  31      2.7 4.000000 3.625 3.000000     24
## 24       F  37      2.3 3.666667 2.750 2.416667      9
## 25       F  26      3.7 3.666667 1.750 2.833333     26
## 26       F  26      4.4 4.416667 3.250 3.166667     32
## 27       M  30      4.1 3.916667 4.000 3.000000     32
## 28       F  33      3.7 3.750000 3.625 2.000000     33
## 29       F  33      2.5 3.250000 2.875 3.500000     29
## 30       M  28      3.0 3.583333 3.000 3.750000     30
## 31       M  26      3.4 4.916667 1.625 2.500000     19
## 32       F  27      3.2 3.583333 3.250 2.083333     23
## 33       F  25      2.0 2.916667 3.500 2.416667     19
## 34       F  31      2.4 3.666667 3.000 2.583333     12
## 35       M  20      4.2 4.500000 3.250 1.583333     10
## 36       F  39      1.6 4.083333 1.875 2.833333     11
## 37       M  38      3.1 3.833333 4.375 1.833333     20
## 38       M  24      3.8 3.250000 3.625 2.416667     26
## 39       M  26      3.8 2.333333 2.500 3.250000     31
## 40       M  25      3.3 3.333333 1.250 3.416667     20
## 41       F  30      1.7 4.083333 4.000 3.416667     23
## 42       F  25      2.5 2.916667 3.000 3.166667     12
## 43       M  30      3.2 3.333333 2.500 3.500000     24
## 44       F  48      3.5 3.833333 4.875 2.666667     17
## 45       F  24      3.2 3.666667 5.000 2.416667     29
## 46       F  40      4.2 4.666667 4.375 3.583333     23
## 47       M  25      3.1 3.750000 3.250 2.083333     28
## 48       F  23      3.9 3.416667 4.000 3.750000     31
## 49       F  25      1.9 4.166667 3.125 2.916667     23
## 50       F  23      2.1 2.916667 2.500 2.916667     25
## 51       M  27      2.5 4.166667 3.125 2.416667     18
## 52       M  25      3.2 3.583333 3.250 3.000000     19
## 53       M  23      3.2 2.833333 2.125 3.416667     22
## 54       F  23      2.6 4.000000 2.750 2.916667     25
## 55       F  23      2.3 2.916667 2.375 3.250000     21
## 56       F  45      3.8 3.000000 3.125 3.250000      9
## 57       F  22      2.8 4.083333 4.000 2.333333     28
## 58       F  23      3.3 2.916667 4.000 3.250000     25
## 59       M  21      4.8 3.500000 2.250 2.500000     29
## 60       M  21      4.0 4.333333 3.250 1.750000     33
## 61       F  21      4.0 4.250000 3.625 2.250000     33
## 62       F  21      4.7 3.416667 3.625 2.083333     25
## 63       F  26      2.3 3.083333 2.500 2.833333     18
## 64       F  25      3.1 4.583333 1.875 2.833333     22
## 65       F  26      2.7 3.416667 2.000 2.416667     17
## 66       M  21      4.1 3.416667 1.875 2.250000     25
## 67       F  23      3.4 3.416667 4.000 2.833333     28
## 68       F  22      2.5 3.583333 2.875 2.250000     22
## 69       F  22      2.1 1.583333 3.875 1.833333     26
## 70       F  22      1.4 3.333333 2.500 2.916667     11
## 71       F  23      1.9 4.333333 2.750 2.916667     29
## 72       M  22      3.7 4.416667 4.500 2.083333     22
## 73       M  23      3.2 4.833333 3.375 2.333333     21
## 74       M  24      2.8 3.083333 2.625 2.416667     28
## 75       F  22      4.1 3.000000 4.125 2.750000     33
## 76       F  23      2.5 4.083333 2.625 3.250000     16
## 77       M  22      2.8 4.083333 2.250 1.750000     31
## 78       M  20      3.8 3.750000 2.750 2.583333     22
## 79       M  22      3.1 3.083333 3.000 3.333333     31
## 80       M  21      3.5 4.750000 1.625 2.833333     23
## 81       F  22      3.6 4.250000 1.875 2.500000     26
## 82       F  23      2.6 4.166667 3.375 2.416667     12
## 83       M  21      4.4 4.416667 3.750 2.416667     26
## 84       M  22      4.5 3.833333 2.125 2.583333     31
## 85       M  29      3.2 3.333333 2.375 3.000000     19
## 86       F  29      3.9 3.166667 2.750 2.000000     30
## 87       F  21      2.5 3.166667 3.125 3.416667     12
## 88       M  28      3.3 3.833333 3.500 2.833333     17
## 89       F  21      3.3 4.250000 2.625 2.250000     18
## 90       F  30      3.0 3.833333 3.375 2.750000     19
## 91       F  21      2.9 3.666667 2.250 3.916667     21
## 92       M  23      3.3 3.833333 3.000 2.333333     24
## 93       F  21      3.3 3.833333 4.000 2.750000     28
## 94       F  21      3.5 3.833333 3.500 2.750000     17
## 95       F  20      3.6 3.666667 2.625 2.916667     18
## 96       M  22      3.7 4.333333 2.500 2.083333     17
## 97       M  21      4.2 3.750000 3.750 3.666667     23
## 98       M  21      3.2 4.166667 3.625 2.833333     26
## 99       F  20      5.0 4.000000 4.125 3.416667     28
## 100      M  22      4.7 4.000000 4.375 1.583333     31
## 101      F  20      3.6 4.583333 2.625 2.916667     27
## 102      F  20      3.6 3.666667 4.000 3.000000     25
## 103      M  24      2.9 3.666667 2.750 2.916667     23
## 104      F  20      3.5 3.833333 2.750 2.666667     21
## 105      F  19      4.0 2.583333 1.375 3.000000     27
## 106      F  21      3.5 3.500000 2.250 2.750000     28
## 107      F  21      3.2 3.083333 3.625 3.083333     23
## 108      F  22      2.6 4.250000 3.750 2.500000     21
## 109      F  25      2.0 3.166667 4.000 2.333333     25
## 110      F  21      2.7 3.083333 3.125 3.000000     11
## 111      F  22      3.2 4.166667 3.250 3.000000     19
## 112      F  25      3.3 2.250000 2.125 4.000000     24
## 113      F  20      3.9 3.333333 2.875 3.250000     28
## 114      M  24      3.3 3.083333 1.500 3.500000     21
## 115      F  20      3.0 2.750000 2.500 3.500000     24
## 116      M  21      3.7 3.250000 3.250 3.833333     24
## 117      F  20      2.5 4.000000 3.625 2.916667     20
## 118      F  20      2.9 3.583333 3.875 2.166667     19
## 119      M  31      3.9 4.083333 3.875 1.666667     30
## 120      F  20      3.6 4.250000 2.375 2.083333     22
## 121      F  22      2.9 3.416667 3.000 2.833333     16
## 122      F  22      2.1 3.083333 3.375 3.416667     16
## 123      M  21      3.1 3.500000 2.750 3.333333     19
## 124      M  22      4.0 3.666667 4.500 2.583333     30
## 125      F  21      3.1 4.250000 2.625 2.833333     23
## 126      F  21      2.3 4.250000 2.750 3.333333     19
## 127      F  21      2.8 3.833333 3.250 3.000000     18
## 128      F  21      3.7 4.416667 4.125 2.583333     28
## 129      F  20      2.6 3.500000 3.375 2.416667     21
## 130      F  21      2.4 3.583333 2.750 3.583333     19
## 131      F  25      3.0 3.666667 4.125 2.083333     27
## 132      M  21      2.8 2.083333 3.250 4.333333     24
## 133      F  24      2.9 4.250000 2.875 2.666667     21
## 134      F  20      2.4 3.583333 2.875 3.000000     20
## 135      M  21      3.1 4.000000 2.375 2.666667     28
## 136      F  20      1.9 3.333333 3.875 2.166667     12
## 137      F  20      2.0 3.500000 2.125 2.666667     21
## 138      F  18      3.8 3.166667 4.000 2.250000     28
## 139      F  21      3.4 3.583333 3.250 2.666667     31
## 140      F  19      3.7 3.416667 2.625 3.333333     18
## 141      F  21      2.9 4.250000 2.750 3.500000     25
## 142      F  20      2.3 3.250000 4.000 2.750000     19
## 143      M  21      4.1 4.416667 3.000 2.000000     21
## 144      F  20      2.7 3.250000 3.375 2.833333     16
## 145      F  21      3.5 3.916667 3.875 3.500000      7
## 146      F  20      3.4 3.583333 3.250 2.500000     21
## 147      F  18      3.2 4.500000 3.375 3.166667     17
## 148      M  22      3.3 3.583333 4.125 3.083333     22
## 149      F  22      3.3 3.666667 3.500 2.916667     18
## 150      M  24      3.5 2.583333 2.000 3.166667     25
## 151      F  19      3.2 4.166667 3.625 2.500000     24
## 152      F  20      3.1 3.250000 3.375 3.833333     23
## 153      F  20      2.8 4.333333 2.125 2.250000     23
## 154      F  17      1.7 3.916667 4.625 3.416667     26
## 155      M  19      1.9 2.666667 2.500 3.750000     12
## 156      F  20      3.5 3.083333 2.875 3.000000     32
## 157      F  20      2.4 3.750000 2.750 2.583333     22
## 158      F  20      2.1 4.166667 4.000 3.333333     20
## 159      F  20      2.9 4.166667 2.375 2.833333     21
## 160      F  19      1.9 3.250000 3.875 3.000000     23
## 161      F  19      2.0 4.083333 3.375 2.833333     20
## 162      F  22      4.2 2.916667 1.750 3.166667     28
## 163      M  35      4.1 3.833333 3.000 2.750000     31
## 164      F  18      3.7 3.166667 2.625 3.416667     18
## 165      F  19      3.6 3.416667 2.625 3.000000     30
## 166      M  21      1.8 4.083333 3.375 2.666667     19
# analyze the structure of the dataset
str(csv_table_read)
## 'data.frame':    166 obs. of  7 variables:
##  $ gender  : chr  "F" "M" "F" "M" ...
##  $ age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: num  3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ points  : int  25 12 24 10 22 21 21 31 24 26 ...
# analyze the dimension of the dataset
dim(csv_table_read)
## [1] 166   7
# Missing data? No data is missing.
ff_glimpse(csv_table_read)
## $Continuous
##             label var_type   n missing_n missing_percent mean  sd  min
## age           age    <int> 166         0             0.0 25.5 7.8 17.0
## attitude attitude    <dbl> 166         0             0.0  3.1 0.7  1.4
## deep         deep    <dbl> 166         0             0.0  3.7 0.6  1.6
## stra         stra    <dbl> 166         0             0.0  3.1 0.8  1.2
## surf         surf    <dbl> 166         0             0.0  2.8 0.5  1.6
## points     points    <int> 166         0             0.0 22.7 5.9  7.0
##          quartile_25 median quartile_75  max
## age             21.0   22.0        27.0 55.0
## attitude         2.6    3.2         3.7  5.0
## deep             3.3    3.7         4.1  4.9
## stra             2.6    3.2         3.6  5.0
## surf             2.4    2.8         3.2  4.3
## points          19.0   23.0        27.8 33.0
## 
## $Categorical
##         label var_type   n missing_n missing_percent levels_n levels
## gender gender    <chr> 166         0             0.0        2      -
##        levels_count levels_percent
## gender            -              -
# summary statistics for each variable
missing_glimpse(csv_table_read)
##             label var_type   n missing_n missing_percent
## gender     gender    <chr> 166         0             0.0
## age           age    <int> 166         0             0.0
## attitude attitude    <dbl> 166         0             0.0
## deep         deep    <dbl> 166         0             0.0
## stra         stra    <dbl> 166         0             0.0
## surf         surf    <dbl> 166         0             0.0
## points     points    <int> 166         0             0.0
# Count per gender and percentage male / female
library("scales")
csv_table_read %>% 
  count(gender) %>% 
  mutate(total_percentage = n / nrow(csv_table_read)) %>% 
  mutate(total_percentage2 = percent(total_percentage))
##   gender   n total_percentage total_percentage2
## 1      F 110        0.6626506               66%
## 2      M  56        0.3373494               34%
# Mean and median for exercises points, and learning method per gender
summary(csv_table_read)
##     gender               age           attitude          deep      
##  Length:166         Min.   :17.00   Min.   :1.400   Min.   :1.583  
##  Class :character   1st Qu.:21.00   1st Qu.:2.600   1st Qu.:3.333  
##  Mode  :character   Median :22.00   Median :3.200   Median :3.667  
##                     Mean   :25.51   Mean   :3.143   Mean   :3.680  
##                     3rd Qu.:27.00   3rd Qu.:3.700   3rd Qu.:4.083  
##                     Max.   :55.00   Max.   :5.000   Max.   :4.917  
##       stra            surf           points     
##  Min.   :1.250   Min.   :1.583   Min.   : 7.00  
##  1st Qu.:2.625   1st Qu.:2.417   1st Qu.:19.00  
##  Median :3.188   Median :2.833   Median :23.00  
##  Mean   :3.121   Mean   :2.787   Mean   :22.72  
##  3rd Qu.:3.625   3rd Qu.:3.167   3rd Qu.:27.75  
##  Max.   :5.000   Max.   :4.333   Max.   :33.00
# The age varies from 17 to 55, mean is 25 and median 22. it suggests that there are some relatively higher values in the dataset
# The attitude varies from 1.4 to 5
# The points are from 7 to 33 and the mean is 22 and the median is 23. It suggests that there are some relatively lower values in the dataset
# we analyze the variables for both genders and females

# draw a scatter plot matrix of the variables in learning2014.
# [-1] excludes the first column (gender)
pairs(csv_table_read[-1])

# access the GGally and ggplot2 libraries
library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(ggplot2)

# create a more advanced plot matrix with ggpairs()
p <- ggpairs(csv_table_read, mapping = aes(), lower = list(combo = wrap("facethist", bins = 20)))
p

# some data shows that there could be correlation between some variables

# Relationship between points and attitudes
csv_table_read %>% 
  ggplot(aes(x= attitude, y= points)) +
  geom_point() +
  facet_wrap(~ gender) +
  geom_smooth(method = "lm")
## `geom_smooth()` using formula = 'y ~ x'

QUESTION 2, 3 and 4:

Female model

#female model
female_data <- csv_table_read %>% 
  filter(gender == "F")

View(female_data)


# Fit a multiple linear model for females. Let's check how points are influenced by age, attitude and deep learning approach
female_fitmodel <- lm(points ~ age + attitude + deep, data = female_data)
# In this model I want to check if age, attitude and deep impact points without impacting each other. 

# summary of std, p value and 
summary(female_fitmodel) 
## 
## Call:
## lm(formula = points ~ age + attitude + deep, data = female_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -17.058  -3.263   0.622   4.003  10.533 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.59029    4.28982   3.168  0.00201 ** 
## age         -0.01743    0.06983  -0.250  0.80338    
## attitude     3.40151    0.70837   4.802 5.19e-06 ***
## deep        -0.27355    0.96270  -0.284  0.77685    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.353 on 106 degrees of freedom
## Multiple R-squared:  0.1791, Adjusted R-squared:  0.1559 
## F-statistic:  7.71 on 3 and 106 DF,  p-value: 0.0001043
summary(female_fitmodel)$r.squared
## [1] 0.1791183
#"age," "attitude," and "deep" explains about 18% of the variation of "points"

# p value intercept: it is significant as very small (0.002) and seems to play a significant role in the regression model
# baseline of model in 13.59 (estimate), when no factors are taken into account.
# age is not significant and is not correlated with points
# deep is not significant and is not correlated with points
# attitude is significant and it seems to play a significant role on the points.
# for one point increase in the attitude, the points increase by 3.63 (estimate)
# High std shows that the estimate is not so precise. It could due to sample size.

# I decide to drop the deep and the age variables and keep only the attitude.
female_fitmodel2 <- lm(points ~ attitude, data = female_data)
summary(female_fitmodel2) 
## 
## Call:
## lm(formula = points ~ attitude, data = female_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.0557  -3.3486   0.6137   3.9819  10.3668 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   12.194      2.156   5.655 1.29e-07 ***
## attitude       3.389      0.701   4.835 4.44e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.307 on 108 degrees of freedom
## Multiple R-squared:  0.1779, Adjusted R-squared:  0.1703 
## F-statistic: 23.38 on 1 and 108 DF,  p-value: 4.442e-06
tidy(female_fitmodel2)
## # A tibble: 2 × 5
##   term        estimate std.error statistic     p.value
##   <chr>          <dbl>     <dbl>     <dbl>       <dbl>
## 1 (Intercept)    12.2      2.16       5.66 0.000000129
## 2 attitude        3.39     0.701      4.83 0.00000444
summary(female_fitmodel2)$r.squared
## [1] 0.1779266
# p value is very low, same for the std, so this model is correct and justify the positive relation vs a positive attitude -> more points.
# rsquare is still quite low..
# The model doesn't provide a good fit for the data, and a significant portion of the variance is not explained. Is could be due to the sample size.

# autoplot: Residuals vs Fitted values, Normal QQ-plot and Residuals vs Leverage
# Identify issues with my regression model, such as non-linearity, non-normality, or influential data points

# autoplot doesnt knit.
#autoplot(female_fitmodel)
#autoplot(female_fitmodel2)

# I use plot function and which to get the desired plots.
plot(female_fitmodel,which = c(1,2,5))

plot(female_fitmodel2,which = c(1,2,5))

# we observe non normality at the end and beginning of the line in qq plot
# both models show that there are some points that are high leverage indicated on the residuals vs leverage

Male model

# male model
male_data <- csv_table_read %>% 
  filter(gender == "M")

View(male_data)
summary(male_data)
##     gender               age          attitude          deep      
##  Length:56          Min.   :19.0   Min.   :1.700   Min.   :2.083  
##  Class :character   1st Qu.:21.0   1st Qu.:3.100   1st Qu.:3.396  
##  Mode  :character   Median :24.0   Median :3.400   Median :3.792  
##                     Mean   :26.8   Mean   :3.443   Mean   :3.725  
##                     3rd Qu.:29.0   3rd Qu.:3.900   3rd Qu.:4.083  
##                     Max.   :55.0   Max.   :4.800   Max.   :4.917  
##       stra            surf           points     
##  Min.   :1.250   Min.   :1.583   Min.   : 9.00  
##  1st Qu.:2.375   1st Qu.:2.312   1st Qu.:20.00  
##  Median :3.000   Median :2.625   Median :23.50  
##  Mean   :2.964   Mean   :2.704   Mean   :23.48  
##  3rd Qu.:3.531   3rd Qu.:3.167   3rd Qu.:28.25  
##  Max.   :4.500   Max.   :4.333   Max.   :33.00
# Fit a multiple linear model for males. Let's check how points are influenced by age, attitude and deep learning approach
male_fitmodel <- lm(points ~ age + attitude + deep, data = male_data)

# summary of std, p value and 
summary(male_fitmodel) 
## 
## Call:
## lm(formula = points ~ age + attitude + deep, data = male_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.8084  -3.3162  -0.0696   3.2195   9.9927 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 18.21897    6.06857   3.002 0.004112 ** 
## age         -0.16602    0.08456  -1.963 0.054974 .  
## attitude     4.31829    1.11699   3.866 0.000309 ***
## deep        -1.38378    1.22006  -1.134 0.261916    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.271 on 52 degrees of freedom
## Multiple R-squared:  0.2718, Adjusted R-squared:  0.2298 
## F-statistic:  6.47 on 3 and 52 DF,  p-value: 0.000835
tidy(male_fitmodel)
## # A tibble: 4 × 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)   18.2      6.07        3.00 0.00411 
## 2 age           -0.166    0.0846     -1.96 0.0550  
## 3 attitude       4.32     1.12        3.87 0.000309
## 4 deep          -1.38     1.22       -1.13 0.262
summary(male_fitmodel)$r.squared 
## [1] 0.2718164
# similar results than for the female. 
# All variables have a smaller p value than for in the female model. 
# rsquare is higher as it explains 27% but it is still quite low. It could be due to the sample size.

# I decide to drop the deep and the age as variables and keep only the attitude.
male_fitmodel2 <- lm(points ~ attitude, data = male_data)
summary(male_fitmodel2) 
## 
## Call:
## lm(formula = points ~ attitude, data = male_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.6535  -2.9073  -0.5121   3.6974  10.2106 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    9.061      3.953   2.292  0.02581 *  
## attitude       4.189      1.129   3.711  0.00049 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.411 on 54 degrees of freedom
## Multiple R-squared:  0.2032, Adjusted R-squared:  0.1884 
## F-statistic: 13.77 on 1 and 54 DF,  p-value: 0.0004897
tidy(male_fitmodel2)
## # A tibble: 2 × 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)     9.06      3.95      2.29 0.0258  
## 2 attitude        4.19      1.13      3.71 0.000490
summary(male_fitmodel)$r.squared 
## [1] 0.2718164
# p value is very low, same for the std, so this model is correct and justify the positive relation vs a positive attitude -> more points.
# rsquare is higher as it explains 27% but it is still quite low
# The model doesn't provide a good fit for the data, and a significant portion of the variance is not explained. Is could be due to the sample size.

# autoplot: Residuals vs Fitted values, Normal QQ-plot and Residuals vs Leverage
# Identify issues with my regression model, such as non-linearity, non-normality, or influential data points

# autoplot doesnt knit !!
# autoplot(male_fitmodel)
# autoplot(male_fitmodel2) 

# plot with the plot function
plot(male_fitmodel,which = c(1,2,5))

plot(male_fitmodel2,which = c(1,2,5))

#The red line in residuals vs fitted stays quite close to the 0 line which is good
# both models show non normality. it is observed at the beginning of the qq plot
# both models show that there are some points that are high leverage indicated on the residuals vs leverage

other models tested:

test_fit1 <- csv_table_read %>% 
  lm(points ~ deep, data = .)
library(ggfortify)
summary(test_fit1)
## 
## Call:
## lm(formula = points ~ deep, data = .)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.6913  -3.6935   0.2862   4.9957  10.3537 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  23.1141     3.0908   7.478 4.31e-12 ***
## deep         -0.1080     0.8306  -0.130    0.897    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.913 on 164 degrees of freedom
## Multiple R-squared:  0.000103,   Adjusted R-squared:  -0.005994 
## F-statistic: 0.01689 on 1 and 164 DF,  p-value: 0.8967
tidy(test_fit1) # p value is small and significant
## # A tibble: 2 × 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)   23.1       3.09      7.48  4.31e-12
## 2 deep          -0.108     0.831    -0.130 8.97e- 1
summary(test_fit1)$r.squared  # too low
## [1] 0.0001029919
test_fit2 <- csv_table_read %>% 
  lm(points ~ deep * gender, data = .)
library(ggfortify)
summary(test_fit2)
## 
## Call:
## lm(formula = points ~ deep * gender, data = .)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.3247  -3.3338   0.3369   4.6242  10.6787 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  22.36387    3.91828   5.708 5.33e-08 ***
## deep         -0.01001    1.06032  -0.009    0.992    
## genderM       2.67476    6.41719   0.417    0.677    
## deep:genderM -0.40787    1.71487  -0.238    0.812    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.922 on 162 degrees of freedom
## Multiple R-squared:  0.00922,    Adjusted R-squared:  -0.009127 
## F-statistic: 0.5025 on 3 and 162 DF,  p-value: 0.6811
tidy(test_fit2) # p value is small and significant
## # A tibble: 4 × 5
##   term         estimate std.error statistic      p.value
##   <chr>           <dbl>     <dbl>     <dbl>        <dbl>
## 1 (Intercept)   22.4         3.92   5.71    0.0000000533
## 2 deep          -0.0100      1.06  -0.00944 0.992       
## 3 genderM        2.67        6.42   0.417   0.677       
## 4 deep:genderM  -0.408       1.71  -0.238   0.812
summary(test_fit2)$r.squared # too low
## [1] 0.009220341

Basic data

# Female vs Male participants
csv_table_read %>% 
  ggplot(aes(x=gender)) +
  geom_bar()

# age chart and gender per age
csv_table_read %>% 
  ggplot(aes(x= age, fill = gender)) +
  geom_bar()

# age chart distribution per gender
csv_table_read %>% 
  ggplot(aes(x= age)) +
  facet_grid(~gender) +
  geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# age box plot distribution per gender
csv_table_read %>% 
  ggplot(aes(x= gender, y=age)) +
  geom_boxplot()

# relationship and distribution between age, points, and gender
csv_table_read %>% 
  ggplot(aes(y = points, x = age, colour = gender)) +
  geom_point() +
  labs(title = "Distribution of points per age and gender")

# with this data we can observe the different age points that drives the mean up (vs the median).

Gender analysis

Gender and points

# Distribution of the points per gender - histogram
csv_table_read %>% 
  ggplot(aes(x = points)) +
  geom_histogram() +
  facet_grid(~gender) +
  labs(title = "Histogram of points by Gender")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

#Distribution of the points per gender - boxplot
csv_table_read %>% 
  ggplot(aes(y = points, x = gender, colour = gender)) +
  geom_boxplot() +
  labs(title = "Boxplot of points by Gender")

#QQ plot - points per gender
csv_table_read %>% 
  ggplot(aes(sample = points)) +      
  geom_qq() +                          
  geom_qq_line(colour = "blue") +      
  facet_grid(~gender)

# mean points per gender - this is not significant
csv_table_read %>%               
  t.test(points ~ gender, data = .)
## 
##  Welch Two Sample t-test
## 
## data:  points by gender
## t = -1.1832, df = 107.84, p-value = 0.2393
## alternative hypothesis: true difference in means between group F and group M is not equal to 0
## 95 percent confidence interval:
##  -3.0896905  0.7799502
## sample estimates:
## mean in group F mean in group M 
##        22.32727        23.48214

Gender and attitude

# attitude vs gender
csv_table_read %>% 
  ggplot(aes(x=gender, y= attitude)) +
  geom_boxplot()

# Type histogram
csv_table_read %>% 
  ggplot(aes(x = attitude)) +
  geom_histogram() +
  facet_grid(~ gender) +
  labs(title = "Histogram of attitude by Gender")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# QQ plot: attitude per gender
csv_table_read %>% 
  ggplot(aes(sample = attitude)) +      
  geom_qq() +                          
  geom_qq_line(colour = "blue") +      
  facet_grid(~gender)

# mean attitude per gender - This is significant and shows a difference between F and M on deep
csv_table_read %>%               
  t.test(attitude ~ gender, data = .)
## 
##  Welch Two Sample t-test
## 
## data:  attitude by gender
## t = -4.0932, df = 122.66, p-value = 7.657e-05
## alternative hypothesis: true difference in means between group F and group M is not equal to 0
## 95 percent confidence interval:
##  -0.6718635 -0.2338508
## sample estimates:
## mean in group F mean in group M 
##        2.990000        3.442857

Gender and deep learning approach:

# deep learning approach vs gender
# We could do that for all approach of learning
# Type histogram
csv_table_read %>% 
  ggplot(aes(x = deep)) +
  geom_histogram() +
  facet_grid(~ gender) +
  labs(title = "Histogram of deep approach by Gender")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

#Type boxplot
csv_table_read %>% 
  ggplot(aes(y = deep, x = gender, fill = gender)) +
  geom_boxplot() +
  labs(title = "Boxplot of deep Approach by Gender")

# QQ plot: deep per gender
csv_table_read %>% 
  ggplot(aes(sample = deep)) +      
  geom_qq() +                          
  geom_qq_line(colour = "blue") +      
  facet_grid(~gender)

# mean deep per gender - This is quite significant and could show a correlation between the gender and the approach deep
csv_table_read %>%               
  t.test(deep ~ gender, data = .)
## 
##  Welch Two Sample t-test
## 
## data:  deep by gender
## t = -0.72082, df = 101.32, p-value = 0.4727
## alternative hypothesis: true difference in means between group F and group M is not equal to 0
## 95 percent confidence interval:
##  -0.2546963  0.1189279
## sample estimates:
## mean in group F mean in group M 
##        3.656818        3.724702

Age Analysis

Age and points

# does not seem to impact on points
csv_table_read %>% 
  ggplot(aes(x= age, y=points)) +
  geom_point()+ 
  facet_wrap(~ gender) +
  geom_smooth(method = "lm")
## `geom_smooth()` using formula = 'y ~ x'

Age and attitude

# does not seem to impact on attitude
csv_table_read %>% 
  ggplot(aes(x= age, y=attitude)) +
  geom_point() +
  facet_wrap(~ gender) +
  geom_smooth(method = "lm")
## `geom_smooth()` using formula = 'y ~ x'

Learning approach: deep analysis

deep and age

# deep learning approach vs age - no correlation
# We could do that for all approach of learning
csv_table_read %>% 
  ggplot(aes(x= age, y= deep)) +
  geom_point()  +
  facet_wrap(~ gender) +
  geom_smooth(method = "lm")
## `geom_smooth()` using formula = 'y ~ x'

deep and points

# deep learning approach vs points - The deep approach seems to have a correlation with the number of points
csv_table_read %>% 
  ggplot(aes(x= deep, y=points)) +
  geom_point() +
  facet_wrap(~ gender) +
  geom_smooth(method = "lm")
## `geom_smooth()` using formula = 'y ~ x'

test_fit <- csv_table_read %>% 
  lm(points ~ deep * gender, data = .)
library(ggfortify)
summary(test_fit)
## 
## Call:
## lm(formula = points ~ deep * gender, data = .)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.3247  -3.3338   0.3369   4.6242  10.6787 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  22.36387    3.91828   5.708 5.33e-08 ***
## deep         -0.01001    1.06032  -0.009    0.992    
## genderM       2.67476    6.41719   0.417    0.677    
## deep:genderM -0.40787    1.71487  -0.238    0.812    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.922 on 162 degrees of freedom
## Multiple R-squared:  0.00922,    Adjusted R-squared:  -0.009127 
## F-statistic: 0.5025 on 3 and 162 DF,  p-value: 0.6811
# deep seems to have a significant impact on "points."

Chapter 3

setwd("/Users/margot/Desktop/Desktop - MacBook Pro de MARGOT/Open data with R 2023/IODS-project")

library(broom)
library(ggplot2)
library(dplyr)
data <- read.csv("alc.csv",sep = ",", header = T)

class(data$high_use)
## [1] "logical"
class(data$famrel)
## [1] "integer"
class(data$goout)
## [1] "integer"
class(data$studytime)
## [1] "integer"

Question 3, 4 and 5

The purpose of your analysis is to study the relationships between high/low alcohol consumption and some of the other variables in the data. To do this, choose 4 interesting variables in the data and for each of them, present your personal hypothesis about their relationships with alcohol consumption. (0-1 point)

Hypothesis 1: students with a good family relationship have less chance to have a high consumption of alcohol

Comments on the below codes chunk: There is a negative correlation between family relationship and the consumption of alcohol. When the relationship is not good (from 1 to 5), there are more risk to drink lot of alcohol. Pvalue is 0,0204 which is representative(<0.05). The coefficient for the variable Famrel in this model is -0.2838. The odds ratio associated with a one-unit change in Famrel is approximately exp(-0.2838) = 0.753. For a one-unit decrease in Famrel, the odds of ‘high_use’ decrease by about 1 - 0.753 = 24.7%. With 95% confidence, the true value of the odds ratio for the intercept lies between approximately 0.49 and 3.35. For the famrel, the odds ratio lies between 0.59 and 0.95, since it is less than 1, there is a statistical significance, with a negative effect from the variable famrel to the high_use. The null deviance is bigger than the residual deviance of the model. It shows that the model explains a part of the variable high-use.

# I add the family = binomial as we do the analyze on a binary variable (True and False)
logistic_model_family_alcohol <-  glm(high_use ~ famrel, data = data, family = binomial)
summary(logistic_model_family_alcohol)
## 
## Call:
## glm(formula = high_use ~ famrel, family = binomial, data = data)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept)   0.2573     0.4853   0.530   0.5960  
## famrel       -0.2838     0.1224  -2.318   0.0204 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 446.67  on 368  degrees of freedom
## AIC: 450.67
## 
## Number of Fisher Scoring iterations: 4
# get the coef
coef(logistic_model_family_alcohol) %>% exp()
## (Intercept)      famrel 
##   1.2934265   0.7528865
# odd of being in high use - exponential of the coef
exp(coef(logistic_model_family_alcohol))
## (Intercept)      famrel 
##   1.2934265   0.7528865
# get the intervals
confint(logistic_model_family_alcohol) %>% exp()
## Waiting for profiling to be done...
##                 2.5 %    97.5 %
## (Intercept) 0.4961949 3.3547129
## famrel      0.5911258 0.9570208
# get the variance, BIC and AIC
logistic_model_family_alcohol %>% 
  glance()
## # A tibble: 1 × 8
##   null.deviance df.null logLik   AIC   BIC deviance df.residual  nobs
##           <dbl>   <int>  <dbl> <dbl> <dbl>    <dbl>       <int> <int>
## 1          452.     369  -223.  451.  458.     447.         368   370
# If I want to check the impact of high use to family relationship -> famrel is ordinal so I use the ordinal regression model
library(ordinal)
## 
## Attaching package: 'ordinal'
## The following object is masked from 'package:dplyr':
## 
##     slice
ordinal_model_family_alcohol <- clm(factor(famrel) ~ high_use, data = data)
summary(ordinal_model_family_alcohol)
## formula: factor(famrel) ~ high_use
## data:    data
## 
##  link  threshold nobs logLik  AIC    niter max.grad cond.H 
##  logit flexible  370  -454.70 919.40 5(0)  3.77e-08 2.0e+01
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)  
## high_useTRUE  -0.5411     0.2139  -2.529   0.0114 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Threshold coefficients:
##     Estimate Std. Error z value
## 1|2  -4.0079     0.3672 -10.914
## 2|3  -2.7746     0.2194 -12.645
## 3|4  -1.3118     0.1422  -9.222
## 4|5   0.8468     0.1300   6.516
# This box plot confirms our hypothesis and conclusion. We can see that the mean and median of the family relationship quality is very different whether the students drink a lot or not. 
data %>% 
  ggplot(aes(x= factor(high_use), y= famrel)) +
  geom_boxplot() +
  labs(title = "Quality of family relationship vs the alcohol consumption",
       x = "High consumption",
       y = "quality of family relationship")

data %>%
  ggplot(aes(x = as.factor(famrel), fill = as.factor(high_use))) +
  geom_bar(position = "fill") +
  labs(
    title = "Relationship Between Family relationship and Alcohol Consumption",
    x = "Family relation score from 1 to 5",
    y = "Proportion of students",
    fill = "High Use of alcohol: True of False"
  )

# there are less score 5 and way more score 2 among the students who drink a lot
data %>%
  ggplot(aes(x = as.factor(high_use), fill = as.factor(famrel))) +
  geom_bar(position = "fill") +
  labs(
    title = "Family relation score distribution by Alcohol Consumption",
    x = "High Use of alcohol: True of False",
    y = "Proportion of students",
    fill = "Family relation score from 1 to 5"
  )

Hypothesis 2: students who go more out with their friends have more risk to have a high consumption of alcohol

Comments on the below codes chunk: There is a a positive relationship between going out and the consumption of alcohol and a significant p value (very small value), which confirms the hypothesis. The more students go out the more risks they have to have a high consumption in alcohol.

The coefficient for goout is 0.7563.The odds ratio associated with a one-unit change in Goout is approximately exp(0.7563) = 2.13.For a one-unit increase in Goout, the odds of ‘high_use’ increase by about 2.13 - 1 = 1.13, which is about 113% increase.

With 95% confidence, the true value of the odds ratio for the intercept lies between approximately 0.015 and 0.078. For the goout, the odds ratio lies between 1.71 and 2.66, since it is > 1, there is a statistical significance with a positive effect of this variable on the high_use.

The null deviance is way bigger than the residual deviance of the model. It shows that the model explains the variable high-use, even better than in the hypothesis 1. The AIC of this model is even smaller than the previous model too, which confirms my conclusion.

logistic_model_goout_alcohol <- glm(high_use ~ goout, data = data, family = binomial)
summary(logistic_model_goout_alcohol)
## 
## Call:
## glm(formula = high_use ~ goout, family = binomial, data = data)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -3.3368     0.4188  -7.968 1.62e-15 ***
## goout         0.7563     0.1165   6.494 8.34e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 403.05  on 368  degrees of freedom
## AIC: 407.05
## 
## Number of Fisher Scoring iterations: 4
tidy(logistic_model_goout_alcohol)
## # A tibble: 2 × 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)   -3.34      0.419     -7.97 1.62e-15
## 2 goout          0.756     0.116      6.49 8.34e-11
# get the coef
coef(logistic_model_goout_alcohol) %>% exp()
## (Intercept)       goout 
##  0.03554946  2.13048020
# get the exp of the coef
exp(coef(logistic_model_goout_alcohol))
## (Intercept)       goout 
##  0.03554946  2.13048020
# get the intervals
confint(logistic_model_goout_alcohol) %>% exp()
## Waiting for profiling to be done...
##                  2.5 %     97.5 %
## (Intercept) 0.01515176 0.07850224
## goout       1.70566958 2.69518608
# plots
data %>% 
  ggplot(aes(x= factor(high_use), y= goout)) +
  geom_boxplot() +
  labs(title = "Going out with friends vs the alcohol consumption",
       x = "High consumption",
       y = "go out with friends (from 1 to 5)")

# the students who are drinking more are more numerous among the students who gave the score 4 and 5 for the goout variable.
data %>%
  ggplot(aes(x = factor(goout), fill = factor(high_use))) +
  geom_bar(position = "fill") +
  labs(
    title = "Relationship Between Going Out Behavior and Alcohol Consumption",
    x = "going out with friends from 1 to 5",
    y = "Proportion of students",
    fill = "High Use of alcohol: True of False"
  )

# there are more score 1, 2 and 3 among the students who do not drink a lot
data %>%
  ggplot(aes(x = factor(high_use), fill = factor(goout))) +
  geom_bar(position = "fill") +
  labs(
    title = "Relationship Between Alcohol Consumption and going out behavior",
    x = "High Use of alcohol: True of False",
    y = "Proportion of students",
    fill = "going out with friends from 1 to 5"
  )

Hypothesis 3: students who study a lot have less risk to have a high consumption of alcohol

Comments about the below code chunk: The p value is very small and estimate is negative. We can conclude there is a negative relation between the study time and the alcohol consumption. When the study time increases the odds of high alcohol consumption decreases.

The coefficient for Studytime is -0.6208. The odds ratio associated with a one-unit change in Studytime is approximately exp(-0.6208) = 0.538. For a one-unit decrease in Studytime, the odds of ‘high_use’ decrease by about 1 - 0.538 = 46.2%.

With 95% confidence, the true value of the odds ratio for the intercept lies between approximately 0.79 and 2.67. For the study time, the odds ratio lies between 0.39 and 0.72, since it is < 1, there is a statistical significance with a negative effect of this variable on the high_use.

The residual deviance is smaller than the null deviance, it means the model explains the variable high use. The AIC is smaller than the first model but not as small as the second model.

logistic_model_studytime_alcohol <- glm(high_use ~ studytime, data = data, family = binomial)
summary(logistic_model_studytime_alcohol)
## 
## Call:
## glm(formula = high_use ~ studytime, family = binomial, data = data)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   0.3649     0.3102   1.176    0.239    
## studytime    -0.6208     0.1542  -4.027 5.66e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 433.82  on 368  degrees of freedom
## AIC: 437.82
## 
## Number of Fisher Scoring iterations: 4
tidy(logistic_model_studytime_alcohol)
## # A tibble: 2 × 5
##   term        estimate std.error statistic   p.value
##   <chr>          <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept)    0.365     0.310      1.18 0.239    
## 2 studytime     -0.621     0.154     -4.03 0.0000566
# get the coef
coef(logistic_model_studytime_alcohol) %>% exp()
## (Intercept)   studytime 
##   1.4403849   0.5375316
#Get the odds - exponential of coefficient: 
exp(coef(logistic_model_studytime_alcohol))
## (Intercept)   studytime 
##   1.4403849   0.5375316
# get the intervals
confint(logistic_model_studytime_alcohol) %>% exp()
## Waiting for profiling to be done...
##                 2.5 %    97.5 %
## (Intercept) 0.7881099 2.6654451
## studytime   0.3933727 0.7208174
# This plot shows the difference in median and mean for the students who drink a lot or not
data %>% 
  ggplot(aes(x= factor(high_use), y= studytime)) +
  geom_boxplot() +
  labs(title = "Study time vs the alcohol consumption",
       x = "High consumption",
       y = "Study time (from 1 to 4)")

# way more students among students who score they study time around 1 and 2
data %>%
  ggplot(aes(x = factor(studytime), fill = factor(high_use))) +
  geom_bar(position = "fill") +
  labs(
    title = "Relationship Between Study time Behavior and Alcohol Consumption",
    x = "Study time from 1 to 4",
    y = "Proportion of students",
    fill = "High Use of alcohol: True of False"
  )

# way more score 1 and way less 3 and 4 among students who drink a lot.
data %>%
  ggplot(aes(x = factor(high_use), fill = factor(studytime))) +
  geom_bar(position = "fill") +
  labs(
    title = "Relationship Between Alcohol Consumption and Study time behavior",
    x = "High Use of alcohol: True of False",
    y = "Proportion of students",
    fill = "Study time from 1 to 4"
  )

Hypothesis 4: students who chose the school because of its reputation has less risk to have a high consumption of alcohol

Comments about the below code: The p value is 0,022 which tells that there is a correlation between the reason to choose a school and the consumption of alcohol.

We don’t know with the test which reason is the most chosen by the students who drink a lot of alcohol vs the ones who don’t. Some plots can help us identify it (check below).

# Creating a contingency table
chi_table_alcohol_reasons <- table((data$high_use), as.factor(data$reason))

# Performing the chi-squared test
chi_square_test <- chisq.test(chi_table_alcohol_reasons )
chi_square_test
## 
##  Pearson's Chi-squared test
## 
## data:  chi_table_alcohol_reasons
## X-squared = 9.563, df = 3, p-value = 0.02267
# Summary of the test
summary(chi_square_test)
##           Length Class  Mode     
## statistic 1      -none- numeric  
## parameter 1      -none- numeric  
## p.value   1      -none- numeric  
## method    1      -none- character
## data.name 1      -none- character
## observed  8      table  numeric  
## expected  8      -none- numeric  
## residuals 8      table  numeric  
## stdres    8      table  numeric
# This show that students who don't drink a lot have chosen their school more for its reputation than the one who drink a lot. For the ones who drink, the reasons home and other are way more important than for the students who don't drink too much. The reason courses seem to be equally important for them.
data %>%
  ggplot(aes(x = high_use, fill = reason)) +
  geom_bar(position = "fill") +
  labs(
    title = "Relationship Between Reason to chose their school and Alcohol Consumption",
    x = "High Consumption of a",
    y = "Proportion of students",
    fill = "High Use of alcohol: True of False"
  )

data %>%
  ggplot(aes(x = reason, fill = high_use)) +
  geom_bar(position = "fill") +
  labs(
    title = "Relationship Between Alcohol Consumption and Reason to chose their school",
    x = "High Use of alcohol: True of False",
    y = "Proportion of students",
    fill = "Reason to choose the school"
  )

Other models to test:

By adding each variable together, we can see that the variation in high use is better explained (this is because the deviance is decreasing from 452 to 379).

All three variables (goout, famrel, and studytime) have statistical significance in predicting alcohol consumption (high_use).

logistic_model_3var <- glm(high_use ~ goout + famrel + studytime, data = data, family = binomial)
logistic_model_3var_2 <- glm(high_use ~ goout * famrel * studytime, data = data, family = binomial)
summary(logistic_model_3var)
## 
## Call:
## glm(formula = high_use ~ goout + famrel + studytime, family = binomial, 
##     data = data)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -0.7286     0.6941  -1.050  0.29381    
## goout         0.7777     0.1203   6.467 9.99e-11 ***
## famrel       -0.3777     0.1367  -2.763  0.00573 ** 
## studytime    -0.6139     0.1672  -3.673  0.00024 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 379.74  on 366  degrees of freedom
## AIC: 387.74
## 
## Number of Fisher Scoring iterations: 4
tidy(logistic_model_3var)
## # A tibble: 4 × 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)   -0.729     0.694     -1.05 2.94e- 1
## 2 goout          0.778     0.120      6.47 9.99e-11
## 3 famrel        -0.378     0.137     -2.76 5.73e- 3
## 4 studytime     -0.614     0.167     -3.67 2.40e- 4
# Anova conclusion: The p-value associated with the difference in deviance (Pr(>Chi)) is 0.1362, suggesting that the difference in fit between the two models is not statistically significant
anova(logistic_model_3var, logistic_model_3var_2, test="LRT")
## Analysis of Deviance Table
## 
## Model 1: high_use ~ goout + famrel + studytime
## Model 2: high_use ~ goout * famrel * studytime
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1       366     379.74                     
## 2       362     372.74  4    6.995   0.1362
# when comparing with for instanc: logistic_model_family_alcohol, the residual deviance of logistic_model_3var is better and the p is significant. 
anova(logistic_model_3var, logistic_model_family_alcohol, test="LRT")
## Analysis of Deviance Table
## 
## Model 1: high_use ~ goout + famrel + studytime
## Model 2: high_use ~ famrel
##   Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
## 1       366     379.74                          
## 2       368     446.67 -2  -66.934 2.921e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# best model is the logistic_model_3var according to BIC and AIC, because metrics are smaller for this model
AIC(logistic_model_3var, logistic_model_3var_2)
##                       df      AIC
## logistic_model_3var    4 387.7372
## logistic_model_3var_2  8 388.7422
BIC(logistic_model_3var, logistic_model_3var_2)
##                       df      BIC
## logistic_model_3var    4 403.3912
## logistic_model_3var_2  8 420.0502

question 6: Model 1 (high_use ~ goout + famrel + studytime)

Using the variables which, according to your logistic regression model, had a statistical relationship with high/low alcohol consumption, explore the predictive power of you model. Provide a 2x2 cross tabulation of predictions versus the actual values and optionally display a graphic visualizing both the actual values and the predictions. Compute the total proportion of inaccurately classified individuals (= the training error) and comment on all the results. Compare the performance of the model with performance achieved by some simple guessing strategy.

# prediction based on the model: logistic_model_3var
predictions <-  predict(logistic_model_3var, type = "response")

#Create the confusion matrix: 
table(data$high_use, predictions > 0.5)
##        
##         FALSE TRUE
##   FALSE   235   24
##   TRUE     63   48
# correct are False false + true true = 235 + 48 = 293
# wrong are False True and True False = 24 + 63 = 87

# Probability of miss classifications  = 87/(293+87) = 23% 
# odds of miss classification = 87/293 = 30%

# precision: correctly positive among all positives predictions = 0.33%
precision <- 24 / (24 + 48)
# recall: correctly positive among all actual positives = 0.28%
recall <- 24 / (24 + 63)
# F1 score: 0.30%
 2 * (precision * recall) / (precision + recall)
## [1] 0.3018868
 # Accuracy of always predicting the majority class:  if the majority class is "FALSE," then predicting "FALSE" for every instance would result in a correct prediction rate of 70%
mean(data$high_use == FALSE)
## [1] 0.7
# Calculate the total proportion of errors is 23.5%. This means my model is a bit better than the majority class which is 70% (30% of non-accuracy).
mean(data$high_use != (predictions > 0.5))
## [1] 0.2351351
# to conclude, my model is a bit better than using the prediction based on the majority class. But it should be improved for better accuracy!

Bonus:

I had to look at chat gpt fully for this one. I did not know the method. I now understand this method splits the data in 10 data samples and will use 9 of them to predict the last one, rotating to test them all. In that case, the average of prediction for the different tests is 74.9% which is better than the 70% (majority class).

library(caret)
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
## 
##     lift
library(lattice)
# Define the model formula
formula <- factor(high_use) ~ goout + famrel + studytime

# Define the number of folds for cross-validation
num_folds <- 10

# Create a training control with 10-fold cross-validation
train_control <- trainControl(method = "cv", number = num_folds)

# Train the model using 10-fold cross-validation
model <- train(formula, data = data, method = "glm", family = binomial, trControl = train_control)

# Get cross-validated results
results <- model$resample
str(results)
## 'data.frame':    10 obs. of  3 variables:
##  $ Accuracy: num  0.784 0.711 0.757 0.73 0.757 ...
##  $ Kappa   : num  0.454 0.205 0.368 0.353 0.33 ...
##  $ Resample: chr  "Fold01" "Fold02" "Fold03" "Fold04" ...
# Calculate mean prediction error
mean(results$Accuracy)
## [1] 0.7570887

Chapter 4

Question 1, 2 and 3

dataset description:

The Boston research seems to look for correlations between the standard of living, housing quality, and other criteria such as the urban environment, accessibility, and demographics within the Boston area.

Each row represents data for a specific town and has information about various factors such as crime rate, proportion of non-retail business acres, property tax rates, pupil-teacher ratios, etc.

This data frame contains the following columns:

  • crim - per capita crime rate by town.

  • zn - proportion of residential land zoned for lots over 25,000 sq.ft.

  • indus - proportion of non-retail business acres per town.

  • chas - Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).

  • nox - nitrogen oxides concentration (parts per 10 million).

  • rm - average number of rooms per dwelling.

  • age - proportion of owner-occupied units built prior to 1940.

  • dis - weighted mean of distances to five Boston employment centres.

  • rad - index of accessibility to radial highways.

  • tax - full-value property-tax rate per $10,000.

  • ptratio - pupil-teacher ratio by town.

  • black - 1000(Bk−0.63) square Bk is the proportion of blacks by town.

  • lstat - lower status of the population (percent).

  • medv - median value of owner-occupied homes in $1000s.

Data analyzes:

Explore the structure and the dimensions of the dataset. Show a graphical overview of the data and show summaries of the variables in the data. Describe and interpret the outputs, the distributions of the variables and the relationships between them.

library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
library(dplyr)
library(ggplot2)
library(GGally)
library(tidyr)

setwd("/Users/margot/Desktop/Desktop - MacBook Pro de MARGOT/Open data with R 2023/IODS-project")

Boston <- Boston
# explore the mean, median, min, max per variable
summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00
# 506 entries with 14 variables
dim(Boston)
## [1] 506  14
# Data type: all num or int
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
# let's create points plot for each variable. -> not easy to read as too many variables
pairs(Boston)

# Let's analyze the histogram for each variable. To use the key facet wrap that will draw one chart per key, we need to have a table with 2 columns: the keys which are all the variables and then the metric. 
gather(Boston) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + 
  geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# Let's compute a correlation matrix
matrix_correlation_var <- cor(Boston)

# Visualize correlation matrix as a heatmap
library(reshape2)
## 
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
## 
##     smiths
ggplot(data = melt(matrix_correlation_var), aes(Var1, Var2, fill = value)) +
  geom_tile() +
  scale_fill_gradient2(low = "blue", high = "red", mid = "white", midpoint = 0)

according to the matrix - Negative relationships between certain variables:

lstat and medv - lower status of population and the median price of homes owned by occupants

lstat and rm - lower status of population and the average of room numbers per home

tax and medv - the percentage at which a property is taxed based on its assessed value and the the median price of homes owned by occupants

dis and lstat - the weighted average distance from each house to these employment centers and the lower status of population

dis and age - the weighted average distance from each house to these employment centers and the percentage or fraction of homes that are owner-occupied and were constructed before the year 1940.

dis and nox - the weighted average distance from each house to these employment centers and nitrogen oxides concentration.

dis and indus - the weighted average distance from each house to these employment centers and the proportion of non-retail business acres per town (manufacturing facilities, industrial parks, office spaces, warehouses etc).

tax and dis - the percentage at which a property is taxed based on its assessed value and the weighted average distance from each house to these employment centers

zn and age - the percentage or fraction of land within residential areas that is for individual lots with a minimum size of over 25,000 square feet and the percentage or fraction of homes that are owner-occupied and were constructed before the year 1940

zn and nox - the percentage or fraction of land within residential areas that is for individual lots with a minimum size of over 25,000 square feet and the nitrogen oxides concentration.

zn and indus - the percentage or fraction of land within residential areas that is for individual lots with a minimum size of over 25,000 square feet and the proportion of non-retail business acres per town (manufacturing facilities, industrial parks, office spaces, warehouses etc).

Positive relationships between variables

medv and rm - the median price of homes owned by occupants and the average of room numbers per home

tax and indus - the percentage at which a property is taxed based on its assessed value and the proportion of non-retail business acres per town.

tax and nox - the percentage at which a property is taxed based on its assessed value and the nitrogen oxides concentration.

age and indus - the percentage or fraction of homes that are owner-occupied and were constructed before the year 1940 and the proportion of non-retail business acres per town.

age and nox- the percentage or fraction of homes that are owner-occupied and were constructed before the year 1940 and the nitrogen oxides concentration.

nox and indus - the nitrogen oxides concentration and the proportion of non-retail business acres per town.

–> tests in my model: sltat, medv, rm, tax, dis, age, nox, indus, zn

Test some models based on the dataset and observed correlation

library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
## The following object is masked from 'package:purrr':
## 
##     some
model1 <-  glm(lstat ~ medv + dis + rm,data=Boston)
model2 <-  glm(medv ~ rm + tax + lstat ,data=Boston)

# all the VIF are between 1 and 2.1 which is ok. It suggest a low multicollinearity and imply that the variance of the estimated coefficients is not significantly inflated due to collinearity
vif(model1)
##     medv      dis       rm 
## 1.982257 1.068810 1.940168
vif(model2)
##       rm      tax    lstat 
## 1.610953 1.426005 2.092901
#let’s calculate the corresponding ODDS ratios and confidence intervals (95%):
OR1 <- coef(model1) %>% exp
OR2 <- coef(model2) %>% exp
CI1 <- confint(model1) %>% exp
## Waiting for profiling to be done...
CI2 <- confint(model2) %>% exp
## Waiting for profiling to be done...
# the confidence interval for an odds ratio doesn't span 1 = there's a statistically significant effect for both models
cbind(OR1, CI1) 
##                      OR1        2.5 %       97.5 %
## (Intercept) 1.764803e+16 4.023272e+14 7.741285e+17
## medv        6.606608e-01 6.250484e-01 6.983022e-01
## dis         3.292580e-01 2.756488e-01 3.932934e-01
## rm          1.682751e-01 8.210665e-02 3.448749e-01
cbind(OR2, CI2)
##                     OR2        2.5 %      97.5 %
## (Intercept)   0.6073487  0.001289619 286.0319984
## rm          181.1868455 76.546116788 428.8744403
## tax           0.9935198  0.990167804   0.9968832
## lstat         0.5754723  0.522466602   0.6338557
# the residual deviance is way smaller than the null deviance. It indicates a reasonably good fit of the model to the data.
summary(model1)
## 
## Call:
## glm(formula = lstat ~ medv + dis + rm, data = Boston)
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 37.40940    1.92918  19.391  < 2e-16 ***
## medv        -0.41451    0.02827 -14.662  < 2e-16 ***
## dis         -1.11091    0.09067 -12.252  < 2e-16 ***
## rm          -1.78215    0.36612  -4.868 1.51e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 17.22406)
## 
##     Null deviance: 25752.4  on 505  degrees of freedom
## Residual deviance:  8646.5  on 502  degrees of freedom
## AIC: 2882.2
## 
## Number of Fisher Scoring iterations: 2
# medv and rm also influences each other, so let's modify the model a bit
model11 <-  glm(lstat ~ medv + dis + rm + rm * medv ,data=Boston)
summary(model11)
## 
## Call:
## glm(formula = lstat ~ medv + dis + rm + rm * medv, data = Boston)
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 74.41052    3.64466   20.42   <2e-16 ***
## medv        -1.94316    0.13517  -14.38   <2e-16 ***
## dis         -0.89569    0.08285  -10.81   <2e-16 ***
## rm          -7.55541    0.59817  -12.63   <2e-16 ***
## medv:rm      0.22526    0.01957   11.51   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 13.64918)
## 
##     Null deviance: 25752.4  on 505  degrees of freedom
## Residual deviance:  6838.2  on 501  degrees of freedom
## AIC: 2765.5
## 
## Number of Fisher Scoring iterations: 2
# same for this one, residual deviance is way smaller than the null deviance. 
summary(model2)
## 
## Call:
## glm(formula = medv ~ rm + tax + lstat, data = Boston)
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.498652   3.140239  -0.159 0.873895    
## rm           5.199529   0.439618  11.827  < 2e-16 ***
## tax         -0.006501   0.001724  -3.770 0.000182 ***
## lstat       -0.552564   0.049302 -11.208  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 29.90866)
## 
##     Null deviance: 42716  on 505  degrees of freedom
## Residual deviance: 15014  on 502  degrees of freedom
## AIC: 3161.4
## 
## Number of Fisher Scoring iterations: 2

Question 4

Standardizing the data. Variables often have different scales and units, making direct comparisons challenging. Standardization brings all variables to a common scale, allowing for fair comparisons between different variables. It makes the distribution of each variable more consistent, with a mean of 0 and a standard deviation of 1. This normalization aids in interpreting coefficients and comparing the relative importance of different predictors in regression. Standardizing ensures that each variable contributes equally to these techniques, preventing one variable from dominating the analysis due to its scale. It allows easier comparison of the magnitude of the effect of each variable on the outcome. Finally it can mitigate issues related to multicollinearity in regression analysis by putting variables on a similar scale, reducing the impact of differing scales on regression coefficients.

# select numerical values
Boston_numeric_cols <- Boston[, sapply(Boston, is.numeric)]

# The scale() to standardize and transform the data to have a mean of 0 and a standard deviation of 1.
scaled_boston <- scale(Boston_numeric_cols)

# convert to a data frame
scaled_table_boston <- as.data.frame(scaled_boston)

# how did the data change? Mean is now 0 so it has worked;
summary(scaled_table_boston)
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865
# use the cut function to create categorical variables based on intervals or breaks in a numerical variable. We do this process for the crim variable from 0 to 0.25 to 0.50 to 0.75 to 1 (quantiles). Add labels for each category. 
# include.lowest = TRUE is to ensure there is no NA category.
quantiles <- quantile(Boston$crim, probs = seq(0, 1, by = 0.25), na.rm = TRUE)
interval_labels <- c("premier_quantil", "second_quantil", "third_quantil", "fourth_quantil")  

scaled_table_boston$quantiles_crime <- cut(Boston$crim, quantiles, labels= interval_labels,include.lowest = TRUE)
# Notes: Quantiles derived from numeric values can be considered categorical or continuous. When quantiles represent discrete categories or bins that divide a continuous variable into distinct groups, they are treated as categorical (e. small, medium, big). If quantiles represent numeric values that indicate the position or value relative to the distribution of a continuous variable (e.g., the 25th percentile, median, 75th percentile), they are considered continuous.

# drop the former column crim and create a new table. 
Boston_new <- scaled_table_boston %>% dplyr::select(-crim)

# We need 80% of the rows from total rows
train_size <- round(0.8 * nrow(Boston_new))

# Select a sample randomly among the dataset 80%
train_set <- sample(seq_len(nrow(Boston_new)), size = train_size)

# Create training and testing subsets
train_data <- Boston_new[train_set, ]
test_data <- Boston_new[-train_set, ]

Question 5:

Fit the linear discriminant analysis on the train set. Use the categorical crime rate as the target variable and all the other variables in the dataset as predictor variables. Draw the LDA (bi)plot.

Linear Discriminant Analysis (LDA) can be quite useful in various real-life scenarios where classification or dimensionality reduction is required such as: Marketing and Customer Segmentation, product quality control, credit scoring.

It is ideal when we have multiple variables that can help us categorize each entry in one specific group where all others will have similar mean, and variance. The goal is to predict the categorical outcome or class based on a set of predictors.

LDA transforms the original variables into a smaller set of new variables, known as linear discriminants. These new variables (linear discriminants) are created in a way that maximizes the separation between different categories or classes in the data.By using the information captured in the linear discriminants, LDA helps in accurately assigning new observations (data points) to their respective categories or classes. LDA takes variables, combines them in a smart way to create new variables that are helpful for understanding differences between groups or categories, and then uses this understanding to categorize or classify new data based on what it has learned. This makes it useful for both simplifying data and making predictions about categories in that data.

LDA calculates the mean and variance for each predictor within each class or category in the target variable. It finds coefficients for these combinations by considering differences in means between classes and assumes equal variances for each predictor within classes. It assumes that the data approximately follows a normal distribution.

# check the impact of each variable on a categorical variable (quantiles_crime)
# each quantiles are approximately equal 25%
# LD1 explains 96% of the model. When LD1 captures 96% of the variance, it suggests that LD1 effectively separates the classes or groups in the data based on their mean differences and variance. Within LD1, data points are grouped together in a way that maximizes the separation between classes while minimizing the variation within each class.
# Positive coefficients indicate that higher values of that variable contribute to a higher score in that particular discriminant, while negative coefficients suggest the opposite. The larger the magnitude of the coefficient, the greater the impact of that variable on the discriminant.

library(MASS)

# Fit LDA on the train set - LDA is a statistical method used for classification. It fits an LDA model using the train_data, where quantiles_crime is the target variable to predict based on the other variables
lda_model <- lda(quantiles_crime ~ ., data = train_data)

# Predict on the test set by using the function predict to apply the model created from the train_data to test_data -> the idea here is to predict the class labels for this test_data
lda_pred <- predict(lda_model, test_data)$class # class extract the predicted class labels

# actual crime quantiles
actual_crime_categories <- test_data$quantiles_crime

# Extract LD scores from the LDA model's predictions
lda_scores <- predict(lda_model, test_data)$x # x accesses the matrix of posterior probabilities or scores associated with each class for the observations in the test_data.

# Create a dataframe with LD scores from first 2 columns of the lda_score and the predicted classes. Combining LD1, LD2, and the predicted classes for visualization. In many cases, visualizing beyond LD1 and LD2 might become complex to display effectively in two-dimensional plots. LD1 and LD2 are typically chosen for visualization as they capture the most discrimination power while allowing for a clearer visualization on a 2D plot.
plot_data <- data.frame(LD1 = lda_scores[, 1], LD2 = lda_scores[, 2], prediction_crime_quantile = as.factor(lda_pred))

plot_data
##             LD1         LD2 prediction_crime_quantile
## 13  -2.19495906  0.05903739            second_quantil
## 20  -1.98773503 -0.52094792            second_quantil
## 21  -1.49908173 -0.95224424             third_quantil
## 24  -1.56564250 -1.06178847             third_quantil
## 39  -2.21004108  0.15120955            second_quantil
## 46  -3.30338408  0.24096910            second_quantil
## 49  -2.22126948 -0.95126769            second_quantil
## 61  -0.93282684  0.90823135            second_quantil
## 64  -1.29234733  0.76811854            second_quantil
## 65  -3.12145993 -0.34896538            second_quantil
## 74  -3.29156028  0.47978327            second_quantil
## 82  -2.62520888  0.56228796           premier_quantil
## 86  -3.10163904 -0.24846952            second_quantil
## 92  -3.16892206 -0.43989685            second_quantil
## 93  -2.64192556  0.58322356           premier_quantil
## 98  -3.31473562 -1.09742740            second_quantil
## 101 -1.57223862 -0.86169268             third_quantil
## 106 -1.23809842 -0.72007431             third_quantil
## 107 -1.22462680 -0.72834052             third_quantil
## 118 -1.34645502 -0.54799558             third_quantil
## 133 -1.33540967 -2.01026803             third_quantil
## 138 -1.38233763 -1.86481790             third_quantil
## 148  0.06002818 -2.77823674             third_quantil
## 152 -0.24758231 -2.68662138             third_quantil
## 155 -0.65994332 -2.95090625             third_quantil
## 156 -0.47729143 -2.75626171             third_quantil
## 162 -1.38267474 -2.41761081             third_quantil
## 166 -1.28786330 -1.54996219             third_quantil
## 167 -1.34089136 -2.64123224             third_quantil
## 168 -1.32960771 -1.36894830             third_quantil
## 175 -1.99257815 -0.05827807            second_quantil
## 179 -2.01173128 -0.54740105            second_quantil
## 181 -2.56727400 -1.20727137             third_quantil
## 186 -2.52083834 -0.54620382            second_quantil
## 194 -4.41841995  2.04956409           premier_quantil
## 212 -2.02305826 -1.03071934            second_quantil
## 214 -2.64258917 -0.29272399            second_quantil
## 215 -2.17042542 -0.16114623            second_quantil
## 222 -0.68657531 -0.64490619            second_quantil
## 224 -0.78950217 -0.37098639             third_quantil
## 234 -0.82052112 -1.16624410             third_quantil
## 251 -2.12718479  1.24941152           premier_quantil
## 257 -3.22605746  2.43809473           premier_quantil
## 259 -1.30344843 -1.07330538             third_quantil
## 260 -1.39975619 -0.75311722             third_quantil
## 266 -1.61434309  0.14555608             third_quantil
## 267 -1.32953803 -0.84632984             third_quantil
## 270 -3.00405224  0.23746305            second_quantil
## 280 -2.70117177  0.82659369           premier_quantil
## 283 -2.74442403 -0.21435174            second_quantil
## 286 -4.37411298  1.85089945           premier_quantil
## 294 -3.29708997  0.08945746            second_quantil
## 297 -3.01949543 -0.54309744            second_quantil
## 301 -2.78704951  2.34421790           premier_quantil
## 308 -1.27478332  1.13087979           premier_quantil
## 309 -2.25953966 -0.86039721             third_quantil
## 320 -2.22669990 -0.69386006            second_quantil
## 322 -2.19314060 -0.20411626            second_quantil
## 325 -2.31673625 -0.10149749            second_quantil
## 330 -3.15016061  0.34245714            second_quantil
## 338 -1.97067647 -0.24500701            second_quantil
## 339 -2.20264295  0.07876816            second_quantil
## 340 -2.11654735  0.01513690            second_quantil
## 348 -2.94349176  2.76843746           premier_quantil
## 349 -2.98679641  2.63024741           premier_quantil
## 353 -3.15770834  2.18893191           premier_quantil
## 356 -2.81926788  2.81953007           premier_quantil
## 360  6.28475167 -0.21754982            fourth_quantil
## 366  6.59078774  0.59263634            fourth_quantil
## 367  6.51793327  0.24068984            fourth_quantil
## 375  6.92428804  0.27931281            fourth_quantil
## 376  5.73510806  0.22114940            fourth_quantil
## 378  6.00737853  0.16219149            fourth_quantil
## 380  6.05947758  0.35744068            fourth_quantil
## 386  6.48573099  0.28563022            fourth_quantil
## 390  6.28583386  0.33607651            fourth_quantil
## 402  6.00576993  0.33966215            fourth_quantil
## 403  6.13729073  0.14687884            fourth_quantil
## 410  6.33183637  0.03242766            fourth_quantil
## 411  6.21645960  0.95188971            fourth_quantil
## 415  7.22526153  0.24293928            fourth_quantil
## 422  6.24957653  0.17896627            fourth_quantil
## 437  6.63743599  0.04886889            fourth_quantil
## 439  7.03548322 -0.13812762            fourth_quantil
## 450  6.28310020 -0.01419889            fourth_quantil
## 452  6.16902774 -0.11278942            fourth_quantil
## 453  6.12856747  0.02544060            fourth_quantil
## 454  6.05873291 -0.35605841            fourth_quantil
## 470  5.36955703  1.15572044            fourth_quantil
## 471  5.59006042  0.65749835            fourth_quantil
## 477  5.77328416  0.42689040            fourth_quantil
## 482  5.06944873  0.96145356            fourth_quantil
## 486  5.13094364  1.00860904            fourth_quantil
## 489 -1.30568868 -1.87826668             third_quantil
## 493 -1.50897155 -1.95177955             third_quantil
## 498 -1.10881911 -0.61579729             third_quantil
## 499 -1.15127139 -0.60799991             third_quantil
## 500 -1.02168534 -0.54515001             third_quantil
## 501 -1.07827967 -0.68616699             third_quantil
## 503 -2.92111074 -1.18254661             third_quantil
## 506 -3.10533674 -0.90557694           premier_quantil
# Create a scatterplot of LD1 and LD2
plot_LDA <- ggplot(plot_data, aes(x = LD1, y = LD2, color = prediction_crime_quantile)) +
  geom_point() +
  labs(title = "LDA Biplot with Predicted Crime Quantiles")

plot_LDA

# adding real values - comparison of actual vs predicted values in test_data
realVsPred_plot <- plot_LDA + 
  geom_point(aes(color = actual_crime_categories), size = 4, alpha = 0.1) +
  labs(color = "Real Quantiles of Crime")

realVsPred_plot

# the accuracy of predictions using test data
accuracy <- mean(lda_pred == test_data$quantiles_crime)
print(paste("Accuracy of LDA model on test data:", round(accuracy * 100, 2), "%"))
## [1] "Accuracy of LDA model on test data: 68.32 %"

Question 6:

Save the crime categories from the test set and then remove the categorical crime variable from the test dataset. Then predict the classes with the LDA model on the test data. Cross tabulate the results with the crime categories from the test set. Comment on the results. (0-3 points)

# save the crime data: 
actual_crime_categories <- test_data$quantiles_crime

# Remove the categorical crime variable from the test dataset
test_data_without_crime <- test_data %>% dplyr::select(-quantiles_crime)

# get the classes based on the model - this was done earlier with lda_pred. so I am a bit confused.
lda_pred_test <- predict(lda_model, test_data_without_crime)$class

# get the table with the prediction vs actual - results are same between the 2 ways since I did 2 times the same steps. I might have missed something in the requests.
cross_tab <- table(Predicted = lda_pred_test, Actual = actual_crime_categories)
cross_tab
##                  Actual
## Predicted         premier_quantil second_quantil third_quantil fourth_quantil
##   premier_quantil              11              3             0              0
##   second_quantil               11             13             6              0
##   third_quantil                 2              9            20              0
##   fourth_quantil                0              0             1             25
cross_table <- table(Predicted = lda_pred, Actual = actual_crime_categories)
cross_table
##                  Actual
## Predicted         premier_quantil second_quantil third_quantil fourth_quantil
##   premier_quantil              11              3             0              0
##   second_quantil               11             13             6              0
##   third_quantil                 2              9            20              0
##   fourth_quantil                0              0             1             25

Question 7:

Reload the Boston dataset and standardize the dataset (we did not do this in the Exercise Set, but you should scale the variables to get comparable distances). Calculate the distances between the observations. Run k-means algorithm on the dataset. Investigate what is the optimal number of clusters and run the algorithm again. Visualize the clusters (for example with the pairs() or ggpairs() functions, where the clusters are separated with colors) and interpret the results. (0-4 points)

# Boston is reload
Boston_load <- Boston

# scale of Boston
Boston_scaled <- scale(Boston_load)

# Calculate distance betwwen scaled data points: 
distances <- dist(Boston_scaled)

# Visualize the distances using fviz_dist()
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
fviz_dist(distances, gradient = list(low = "#00AFBB", mid = "white", high = "#FC4E07"))

# k mean: The k-means algorithm groups observations into clusters based on their similarity, aiming to minimize the variation within clusters while maximizing the variation between clusters. We need to define the number of clusters we need to do the kmean analyse. 

# Elbow Method to find optimal number of clusters: Calculate the Within-Cluster-Sum of Squared Errors (WSS) for different values of k, and choose the k for which WSS becomes first starts to diminish. In the plot of WSS-versus-k, this is visible as an elbow.
# The Squared Error for each point is the square of the distance of the point from its representation i.e. its predicted cluster center.
# The WSS score is the sum of these Squared Errors for all the points.
wss <- numeric(10)  # Initialize within-cluster sum of squares vector
for (i in 1:10) {
  kmeans_model <- kmeans(Boston_scaled, centers = i, nstart = 10)
  wss[i] <- kmeans_model$tot.withinss  # Store within-cluster sum of squares
}

# Plotting the Elbow Method
plot(1:10, wss, type = "b", xlab = "Number of Clusters", ylab = "Within-cluster Sum of Squares", main = "Elbow Method")

# or we can use the direct function for the Elbow method
kmeans_model_optimal2 <- fviz_nbclust(Boston_scaled, kmeans, method = "wss")
# or we can use the silhuette method
fviz_nbclust(Boston_scaled, kmeans, method = "silhouette")

# Visualize clusters using pairs() or ggpairs()
pairs(Boston_scaled, col = kmeans_model$cluster)

k <- 2  # 2 seems to be the best option according to the Elbow Method and the silhouette method 

#Kmean model:
kmeans_model <- kmeans(Boston_scaled, centers = k, nstart = 25)

cluster_assignments <- kmeans_model$cluster

# visualize the clusters thanks to fviz_cluster function
fviz_cluster(kmeans_model, data = Boston_scaled)

library(GGally)

# Combine the scaled data with the cluster assignments (cluster 1 or 2)
clustered_data <- cbind(as.data.frame(Boston_scaled), Cluster = as.factor(cluster_assignments))

# Visualize clusters using ggpairs()
ggpairs(clustered_data, aes(color = Cluster))
## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# Mean for each cluster and variable:
clustered_data %>%
  mutate(Cluster = kmeans_model$cluster) %>%
  group_by(Cluster) %>%
  summarise_all("mean")
## # A tibble: 2 × 15
##   Cluster   crim     zn  indus     chas    nox     rm    age    dis    rad
##     <int>  <dbl>  <dbl>  <dbl>    <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
## 1       1 -0.389  0.262 -0.615  0.00291 -0.582  0.245 -0.433  0.454 -0.583
## 2       2  0.724 -0.487  1.14  -0.00541  1.08  -0.455  0.805 -0.844  1.08 
## # ℹ 5 more variables: tax <dbl>, ptratio <dbl>, black <dbl>, lstat <dbl>,
## #   medv <dbl>
model_predictors <- train_data %>% dplyr::select(-quantiles_crime)

# check the dimensions
dim(model_predictors)
## [1] 405  13
dim(lda_model$scaling)
## [1] 13  3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda_model$scaling
matrix_product <- as.data.frame(matrix_product)
library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
## 
##     select
## The following object is masked from 'package:ordinal':
## 
##     slice
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers')

Chapter 5

Question 1

Move the country names to rownames (see Exercise 5.5). Show a graphical overview of the data and show summaries of the variables in the data. Describe and interpret the outputs, commenting on the distributions of the variables and the relationships between them. (0-3 points)

Data import and wrangling

library(tibble)
library(readr)
library(corrplot)
## corrplot 0.92 loaded
library(ggplot2)
library(tidyr)
library(GGally)

human_new <- read_csv("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/human2.csv")
## Rows: 155 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Country
## dbl (8): Edu2.FM, Labo.FM, Life.Exp, Edu.Exp, GNI, Mat.Mor, Ado.Birth, Parli.F
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# no NA
any(is.na(human_new))
## [1] FALSE
# analyze and summary of variables
str(human_new)
## spc_tbl_ [155 × 9] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Country  : chr [1:155] "Norway" "Australia" "Switzerland" "Denmark" ...
##  $ Edu2.FM  : num [1:155] 1.007 0.997 0.983 0.989 0.969 ...
##  $ Labo.FM  : num [1:155] 0.891 0.819 0.825 0.884 0.829 ...
##  $ Life.Exp : num [1:155] 81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
##  $ Edu.Exp  : num [1:155] 17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
##  $ GNI      : num [1:155] 64992 42261 56431 44025 45435 ...
##  $ Mat.Mor  : num [1:155] 4 6 6 5 6 7 9 28 11 8 ...
##  $ Ado.Birth: num [1:155] 7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
##  $ Parli.F  : num [1:155] 39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Country = col_character(),
##   ..   Edu2.FM = col_double(),
##   ..   Labo.FM = col_double(),
##   ..   Life.Exp = col_double(),
##   ..   Edu.Exp = col_double(),
##   ..   GNI = col_double(),
##   ..   Mat.Mor = col_double(),
##   ..   Ado.Birth = col_double(),
##   ..   Parli.F = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>
summary(human_new)
##    Country             Edu2.FM          Labo.FM          Life.Exp    
##  Length:155         Min.   :0.1717   Min.   :0.1857   Min.   :49.00  
##  Class :character   1st Qu.:0.7264   1st Qu.:0.5984   1st Qu.:66.30  
##  Mode  :character   Median :0.9375   Median :0.7535   Median :74.20  
##                     Mean   :0.8529   Mean   :0.7074   Mean   :71.65  
##                     3rd Qu.:0.9968   3rd Qu.:0.8535   3rd Qu.:77.25  
##                     Max.   :1.4967   Max.   :1.0380   Max.   :83.50  
##     Edu.Exp           GNI            Mat.Mor         Ado.Birth     
##  Min.   : 5.40   Min.   :   581   Min.   :   1.0   Min.   :  0.60  
##  1st Qu.:11.25   1st Qu.:  4198   1st Qu.:  11.5   1st Qu.: 12.65  
##  Median :13.50   Median : 12040   Median :  49.0   Median : 33.60  
##  Mean   :13.18   Mean   : 17628   Mean   : 149.1   Mean   : 47.16  
##  3rd Qu.:15.20   3rd Qu.: 24512   3rd Qu.: 190.0   3rd Qu.: 71.95  
##  Max.   :20.20   Max.   :123124   Max.   :1100.0   Max.   :204.80  
##     Parli.F     
##  Min.   : 0.00  
##  1st Qu.:12.40  
##  Median :19.30  
##  Mean   :20.91  
##  3rd Qu.:27.95  
##  Max.   :57.50
# Country is as rownames
human_new <- column_to_rownames(human_new, "Country")

Graphical overview

# histogram per variable
pivot_longer(human_new, cols = everything()) %>% 
  ggplot(aes(value)) + facet_wrap("name", scales = "free") + 
  geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# charts and correlations
ggpairs(human_new,lower=list(combo=wrap("facethist",binwidth=0.1)))

# correlation between variables
c <- cor(human_new)
corrplot(c, method="circle")

Commenting distributions and relationships between them

In majority, the distribution of these variables appears to deviate from a normal distribution.

Life expectancy, Education F/M, Education exp, GNI are negatively correlated with Maternal mortality and Adolescent birth.

Education F/M is positively correlated with Life expectancy, education exp, and GNI. Life expectancy is positively correlated with Education exp, and GNI. Education exp is also positively correlated with GNI. Maternal mortality is correlated to adolescent birth.

> This observed correlations appears coherent, highlighting logical relationships between the various variables.

Question 2

Perform principal component analysis (PCA) on the raw (non-standardized) human data. Show the variability captured by the principal components. Draw a biplot displaying the observations by the first two principal components (PC1 coordinate in x-axis, PC2 coordinate in y-axis), along with arrows representing the original variables. (0-2 points)

Non-standardized PCA

# PCA
pca_human <- prcomp(human_new)

# draw a biplot of the principal component representation and the original variables
biplot(pca_human, choices = 1:2)
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

# GNI has too big variances and is dominating the pca. The magnitude is too different between the metrics

Question 3 - standardized PCA

Standardize the variables in the human data and repeat the above analysis. Interpret the results of both analysis (with and without standardizing). Are the results different? Why or why not? Include captions (brief descriptions) in your plots where you describe the results by using not just your variable names, but the actual phenomena they relate to. (0-4 points)

Scaled data & standardized PCA

# standardization of human_new
scaled_human <- scale(human_new)

# new PCA with scaled data
pca_human2 <- prcomp(scaled_human)

# draw a biplot of the PCA, adding captions that describe the phenomena
biplot(pca_human2, choices = 1:2,
   xlab="PC1 - Country Socio-Economic Development", ylab="PC2 - Female Empowerment and Socio-Economic Development by Women", main = " Principal Component Analysis of Socio-Economic Indicators: Mapping Country Development Factors and Female Empowerment")

Include captions and descriptions on PCA:

PC1 - Country Socio-Economic Development. This component shows multiple socio-economic factors contributing to overall country development. It includes on one side, variables like Gross National Income, Life Expectancy, Education, and on the other side, variables like Maternal mortality and adolescent birth rate. This axis reflects a broad aspect of a country’s overall progress and development.

PC2 - Female Empowerment and Socio-Economic Development by Women. This component specifically highlights aspects related to female empowerment, participation (work or parli), and their impact on socio-economic development.

Question 4 - interpret the PCA

Give your personal interpretations of the first two principal component dimensions based on the biplot drawn after PCA on the standardized human data. (0-2 points):

In the first PCA, the GNI variable has too high metrics and too big variances vs the other variables. This variable dominates the PCA and therefore the PCA cannot be analyzed properly. The magnitude is too different between the metrics to compare them together without scaling them. After scaling the dataset, I did the PCA again to compare them more equally.

Analyze the standardized PCA: The arrows aligned with PC1 and its associated variables provide insight into the relationships between the variables themselves, it is the same for the PC2. The direction of the arrow suggests that the pointed variables have a significant and positive influence on the principal components represented in the plot. The direction of the arrow indicates which way the variable is correlated with the principal component. The length of the arrow represents the strength of the variable’s contribution to that principal component. Longer arrows indicate a stronger influence on that component.

On the left side of the chart, countries appear more developed, characterized by higher values in key development indicators such as Gross National Income, Life Expectancy at Birth, Life Expectancy in Education. These variables has significant influence, suggesting that the variables impact positively each other. As these metrics increase, the level of development tends to be higher. We can find in this subgroup many developed Western and Asian countries (e.g Switzerland, United States, Italy, Australia, Japan, Denmark,etc.). Some of them also have the characteristics of high labor rate of Female vs male which indicates a better situation for Women (e.g Iceland, Norway, Finland Sweden, Denmark). Conversely, countries like Qatar, Bahrain, UAE, and Saudi Arabia display a relatively good development level but possess lower female labor and parliament participation, impacting the overall women development and freedom. Some countries in the middle or even on the right of the PC1 do particularly well with the metrics about women development and freedom, such as Bolivia, Namibia, Burundi.

On the opposite side of the chart, variables such as Adolescent Birth Rate, Maternal Mortality Ratio, portray countries facing more challenging situations. Higher values in these variables correlate with less favorable conditions in a country. These countries, primarily from the Global South, such as Central African Republic, Congo, Burkina Faso face lower overall development. Women situations also vary based on their positioning along PC1, reflecting the diverse and unequal experiences of women in different countries (Afghanistan appears to have a worst situation for Women than in Zimbabwe for instance).

Question 5

Tea data import and exploration

The tea data comes from the FactoMineR package and it is measured with a questionnaire on tea: 300 individuals were asked how they drink tea (18 questions) and what are their product’s perception (12 questions).

In addition, some personal details were asked (4 questions). Load the tea dataset and convert its character variables to factors: tea <- read.csv(“https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/tea.csv”, stringsAsFactors = TRUE)

Explore the data briefly: look at the structure and the dimensions of the data. Use View(tea) to browse its contents, and visualize the data.

tea_new <- read.csv("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/tea.csv", stringsAsFactors = TRUE)

# View and summary of the dataset
view(tea_new)
str(tea_new)
## 'data.frame':    300 obs. of  36 variables:
##  $ breakfast       : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
##  $ tea.time        : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
##  $ evening         : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
##  $ lunch           : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dinner          : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
##  $ always          : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
##  $ home            : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
##  $ work            : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
##  $ tearoom         : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ friends         : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
##  $ resto           : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
##  $ pub             : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Tea             : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How             : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ sugar           : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ how             : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ where           : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ price           : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
##  $ age             : int  39 45 47 23 48 21 37 36 40 37 ...
##  $ sex             : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
##  $ SPC             : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
##  $ Sport           : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
##  $ age_Q           : Factor w/ 5 levels "+60","15-24",..: 4 5 5 2 5 2 4 4 4 4 ...
##  $ frequency       : Factor w/ 4 levels "+2/day","1 to 2/week",..: 3 3 1 3 1 3 4 2 1 1 ...
##  $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
##  $ spirituality    : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
##  $ healthy         : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
##  $ diuretic        : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
##  $ friendliness    : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
##  $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ feminine        : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
##  $ sophisticated   : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
##  $ slimming        : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ exciting        : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
##  $ relaxing        : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
##  $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
summary(tea_new)
##          breakfast           tea.time          evening          lunch    
##  breakfast    :144   Not.tea time:131   evening    :103   lunch    : 44  
##  Not.breakfast:156   tea time    :169   Not.evening:197   Not.lunch:256  
##                                                                          
##                                                                          
##                                                                          
##                                                                          
##                                                                          
##         dinner           always          home           work    
##  dinner    : 21   always    :103   home    :291   Not.work:213  
##  Not.dinner:279   Not.always:197   Not.home:  9   work    : 87  
##                                                                 
##                                                                 
##                                                                 
##                                                                 
##                                                                 
##         tearoom           friends          resto          pub     
##  Not.tearoom:242   friends    :196   Not.resto:221   Not.pub:237  
##  tearoom    : 58   Not.friends:104   resto    : 79   pub    : 63  
##                                                                   
##                                                                   
##                                                                   
##                                                                   
##                                                                   
##         Tea         How           sugar                     how     
##  black    : 74   alone:195   No.sugar:155   tea bag           :170  
##  Earl Grey:193   lemon: 33   sugar   :145   tea bag+unpackaged: 94  
##  green    : 33   milk : 63                  unpackaged        : 36  
##                  other:  9                                          
##                                                                     
##                                                                     
##                                                                     
##                   where                 price          age        sex    
##  chain store         :192   p_branded      : 95   Min.   :15.00   F:178  
##  chain store+tea shop: 78   p_cheap        :  7   1st Qu.:23.00   M:122  
##  tea shop            : 30   p_private label: 21   Median :32.00          
##                             p_unknown      : 12   Mean   :37.05          
##                             p_upscale      : 53   3rd Qu.:48.00          
##                             p_variable     :112   Max.   :90.00          
##                                                                          
##            SPC               Sport       age_Q          frequency  
##  employee    :59   Not.sportsman:121   +60  :38   +2/day     :127  
##  middle      :40   sportsman    :179   15-24:92   1 to 2/week: 44  
##  non-worker  :64                       25-34:69   1/day      : 95  
##  other worker:20                       35-44:40   3 to 6/week: 34  
##  senior      :35                       45-59:61                    
##  student     :70                                                   
##  workman     :12                                                   
##              escape.exoticism           spirituality        healthy   
##  escape-exoticism    :142     Not.spirituality:206   healthy    :210  
##  Not.escape-exoticism:158     spirituality    : 94   Not.healthy: 90  
##                                                                       
##                                                                       
##                                                                       
##                                                                       
##                                                                       
##          diuretic             friendliness            iron.absorption
##  diuretic    :174   friendliness    :242   iron absorption    : 31   
##  Not.diuretic:126   Not.friendliness: 58   Not.iron absorption:269   
##                                                                      
##                                                                      
##                                                                      
##                                                                      
##                                                                      
##          feminine             sophisticated        slimming          exciting  
##  feminine    :129   Not.sophisticated: 85   No.slimming:255   exciting   :116  
##  Not.feminine:171   sophisticated    :215   slimming   : 45   No.exciting:184  
##                                                                                
##                                                                                
##                                                                                
##                                                                                
##                                                                                
##         relaxing              effect.on.health
##  No.relaxing:113   effect on health   : 66    
##  relaxing   :187   No.effect on health:234    
##                                               
##                                               
##                                               
##                                               
## 
# variables are factorial, except few ones. let's map the different categories thanks to the MCA

# But first let's keep only the categorical variables. An alternative would have been to change the non-catgeorical variables to categorical ones.
tea_new_2 <- tea_new %>% dplyr::select(-breakfast,-age)

Multiple Correspondence Analysis (MCA)

On the tea data (or on just certain columns of the data, it is up to you!). Interpret the results of the MCA and draw at least the variable biplot of the analysis. You can also explore other plotting options for MCA. Comment on the output of the plots. (0-4 points)

library(FactoMineR)

# MCA
mca <- MCA(tea_new_2, graph = FALSE)
mca
## **Results of the Multiple Correspondence Analysis (MCA)**
## The analysis was performed on 300 individuals, described by 34 variables
## *The results are available in the following objects:
## 
##    name              description                       
## 1  "$eig"            "eigenvalues"                     
## 2  "$var"            "results for the variables"       
## 3  "$var$coord"      "coord. of the categories"        
## 4  "$var$cos2"       "cos2 for the categories"         
## 5  "$var$contrib"    "contributions of the categories" 
## 6  "$var$v.test"     "v-test for the categories"       
## 7  "$var$eta2"       "coord. of variables"             
## 8  "$ind"            "results for the individuals"     
## 9  "$ind$coord"      "coord. for the individuals"      
## 10 "$ind$cos2"       "cos2 for the individuals"        
## 11 "$ind$contrib"    "contributions of the individuals"
## 12 "$call"           "intermediate results"            
## 13 "$call$marge.col" "weights of columns"              
## 14 "$call$marge.li"  "weights of rows"
# summary of the MCA
summary(mca)
## 
## Call:
## MCA(X = tea_new_2, graph = FALSE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6   Dim.7
## Variance               0.092   0.084   0.072   0.060   0.057   0.055   0.050
## % of var.              5.913   5.392   4.629   3.881   3.672   3.502   3.188
## Cumulative % of var.   5.913  11.305  15.934  19.815  23.488  26.989  30.177
##                        Dim.8   Dim.9  Dim.10  Dim.11  Dim.12  Dim.13  Dim.14
## Variance               0.049   0.048   0.045   0.042   0.042   0.039   0.038
## % of var.              3.138   3.050   2.885   2.675   2.672   2.491   2.424
## Cumulative % of var.  33.315  36.365  39.251  41.926  44.598  47.089  49.513
##                       Dim.15  Dim.16  Dim.17  Dim.18  Dim.19  Dim.20  Dim.21
## Variance               0.036   0.036   0.034   0.032   0.032   0.031   0.030
## % of var.              2.320   2.305   2.173   2.077   2.049   2.016   1.944
## Cumulative % of var.  51.833  54.138  56.311  58.387  60.436  62.453  64.397
##                       Dim.22  Dim.23  Dim.24  Dim.25  Dim.26  Dim.27  Dim.28
## Variance               0.029   0.027   0.026   0.026   0.025   0.024   0.024
## % of var.              1.880   1.761   1.692   1.669   1.624   1.553   1.544
## Cumulative % of var.  66.276  68.038  69.730  71.399  73.023  74.577  76.120
##                       Dim.29  Dim.30  Dim.31  Dim.32  Dim.33  Dim.34  Dim.35
## Variance               0.023   0.022   0.022   0.021   0.020   0.019   0.019
## % of var.              1.456   1.426   1.404   1.339   1.264   1.246   1.221
## Cumulative % of var.  77.576  79.002  80.406  81.745  83.009  84.255  85.476
##                       Dim.36  Dim.37  Dim.38  Dim.39  Dim.40  Dim.41  Dim.42
## Variance               0.017   0.017   0.017   0.016   0.016   0.015   0.015
## % of var.              1.120   1.112   1.078   1.028   1.005   0.968   0.939
## Cumulative % of var.  86.596  87.708  88.786  89.814  90.819  91.786  92.725
##                       Dim.43  Dim.44  Dim.45  Dim.46  Dim.47  Dim.48  Dim.49
## Variance               0.013   0.013   0.012   0.012   0.011   0.010   0.010
## % of var.              0.866   0.837   0.802   0.739   0.697   0.674   0.666
## Cumulative % of var.  93.590  94.427  95.229  95.968  96.666  97.339  98.005
##                       Dim.50  Dim.51  Dim.52  Dim.53
## Variance               0.009   0.008   0.007   0.006
## % of var.              0.603   0.531   0.458   0.403
## Cumulative % of var.  98.608  99.139  99.597 100.000
## 
## Individuals (the 10 first)
##                 Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3    ctr
## 1            | -0.629  1.432  0.202 |  0.148  0.086  0.011 |  0.109  0.055
## 2            | -0.427  0.660  0.139 |  0.288  0.330  0.064 | -0.110  0.056
## 3            |  0.097  0.034  0.006 | -0.156  0.096  0.015 |  0.125  0.073
## 4            | -0.560  1.133  0.227 | -0.279  0.308  0.056 | -0.026  0.003
## 5            | -0.173  0.108  0.028 | -0.149  0.088  0.020 |  0.027  0.003
## 6            | -0.681  1.675  0.272 | -0.293  0.340  0.050 | -0.008  0.000
## 7            | -0.223  0.181  0.037 |  0.015  0.001  0.000 |  0.183  0.155
## 8            | -0.031  0.003  0.001 |  0.111  0.048  0.009 | -0.097  0.043
## 9            | -0.063  0.014  0.003 |  0.266  0.281  0.048 |  0.388  0.697
## 10           |  0.177  0.114  0.021 |  0.369  0.539  0.090 |  0.315  0.459
##                cos2  
## 1             0.006 |
## 2             0.009 |
## 3             0.009 |
## 4             0.000 |
## 5             0.001 |
## 6             0.000 |
## 7             0.025 |
## 8             0.007 |
## 9             0.103 |
## 10            0.066 |
## 
## Categories (the 10 first)
##                 Dim.1    ctr   cos2 v.test    Dim.2    ctr   cos2 v.test  
## Not.tea time | -0.552  4.238  0.236 -8.396 |  0.001  0.000  0.000  0.019 |
## tea time     |  0.428  3.285  0.236  8.396 | -0.001  0.000  0.000 -0.019 |
## evening      |  0.296  0.961  0.046  3.704 | -0.404  1.961  0.085 -5.051 |
## Not.evening  | -0.155  0.503  0.046 -3.704 |  0.211  1.025  0.085  5.051 |
## lunch        |  0.590  1.630  0.060  4.231 | -0.406  0.844  0.028 -2.907 |
## Not.lunch    | -0.101  0.280  0.060 -4.231 |  0.070  0.145  0.028  2.907 |
## dinner       | -1.047  2.447  0.082 -4.965 | -0.080  0.016  0.000 -0.379 |
## Not.dinner   |  0.079  0.184  0.082  4.965 |  0.006  0.001  0.000  0.379 |
## always       |  0.342  1.281  0.061  4.276 | -0.255  0.784  0.034 -3.193 |
## Not.always   | -0.179  0.670  0.061 -4.276 |  0.134  0.410  0.034  3.193 |
##               Dim.3    ctr   cos2 v.test  
## Not.tea time  0.063  0.071  0.003  0.959 |
## tea time     -0.049  0.055  0.003 -0.959 |
## evening       0.324  1.469  0.055  4.052 |
## Not.evening  -0.169  0.768  0.055 -4.052 |
## lunch         0.254  0.386  0.011  1.822 |
## Not.lunch    -0.044  0.066  0.011 -1.822 |
## dinner        0.744  1.581  0.042  3.531 |
## Not.dinner   -0.056  0.119  0.042 -3.531 |
## always        0.095  0.126  0.005  1.186 |
## Not.always   -0.050  0.066  0.005 -1.186 |
## 
## Categorical variables (eta2)
##                Dim.1 Dim.2 Dim.3  
## tea.time     | 0.236 0.000 0.003 |
## evening      | 0.046 0.085 0.055 |
## lunch        | 0.060 0.028 0.011 |
## dinner       | 0.082 0.000 0.042 |
## always       | 0.061 0.034 0.005 |
## home         | 0.013 0.002 0.027 |
## work         | 0.071 0.020 0.027 |
## tearoom      | 0.326 0.021 0.027 |
## friends      | 0.201 0.058 0.024 |
## resto        | 0.155 0.007 0.030 |
# visualize MCA
plot(mca, invisible=c("ind"), graph.type = "classic",col.var = rainbow(ncol(tea_new_2)))

# It allows us to visualize many different categories from many catgeorical variables, all at once to see the categories that fit together and suggest association (similarities in behavior). For instance, unpackages and tea shop seem close to each other and separated from the rest of the categories. It suggests that they share some commonality / similar patterns or behavior in term of users preferences. Besides, they tend to occur together within the dataset.

# Select smaller dataset to do the MCA analyze with less variables
# let's now choose only few variables that could have correlations together but also show distinct behavior between users. 
tea_new_2 %>%
  pivot_longer(cols = everything()) %>%
  ggplot(aes(x = value)) +
  facet_wrap(~ name, scales = "free") +
  geom_bar() +
  labs(title = "Distribution of Categorical Variables") + 
  theme(axis.text.x = element_text(angle = 25, hjust = 1, size = 10)) +
  theme(strip.text = element_text(size = 10))

# smaller data set - I will look at 2 different ones. 
tea_smaller <- tea_new %>% dplyr::select(dinner, effect.on.health, frequency, evening, home, healthy, how, lunch, pub, price,Tea, where, work, sex)

tea_smaller2 <- tea_new %>% dplyr::select(diuretic, feminine, healthy, exciting, iron.absorption, escape.exoticism, relaxing, sophisticated, sex, slimming)

# MCA models
mca2 <- MCA(tea_smaller, graph = FALSE)
mca3 <- MCA(tea_smaller2, graph = FALSE)

# summary of the models
summary(mca2)
## 
## Call:
## MCA(X = tea_smaller, graph = FALSE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6   Dim.7
## Variance               0.154   0.151   0.109   0.100   0.092   0.088   0.081
## % of var.              9.395   9.176   6.648   6.066   5.627   5.331   4.944
## Cumulative % of var.   9.395  18.572  25.220  31.286  36.914  42.245  47.189
##                        Dim.8   Dim.9  Dim.10  Dim.11  Dim.12  Dim.13  Dim.14
## Variance               0.081   0.075   0.071   0.068   0.064   0.063   0.061
## % of var.              4.911   4.595   4.306   4.129   3.887   3.820   3.690
## Cumulative % of var.  52.100  56.695  61.000  65.129  69.016  72.836  76.526
##                       Dim.15  Dim.16  Dim.17  Dim.18  Dim.19  Dim.20  Dim.21
## Variance               0.059   0.054   0.050   0.046   0.045   0.040   0.035
## % of var.              3.589   3.297   3.035   2.806   2.766   2.413   2.110
## Cumulative % of var.  80.114  83.411  86.446  89.252  92.018  94.431  96.542
##                       Dim.22  Dim.23
## Variance               0.033   0.024
## % of var.              2.020   1.438
## Cumulative % of var.  98.562 100.000
## 
## Individuals (the 10 first)
##                        Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3
## 1                   | -0.251  0.136  0.026 |  0.542  0.649  0.120 | -0.166
## 2                   | -0.235  0.120  0.070 |  0.139  0.042  0.024 | -0.231
## 3                   | -0.181  0.070  0.019 | -0.077  0.013  0.003 | -0.043
## 4                   | -0.158  0.054  0.015 |  0.447  0.442  0.124 | -0.200
## 5                   | -0.107  0.025  0.013 |  0.020  0.001  0.000 |  0.383
## 6                   | -0.266  0.153  0.029 |  0.673  1.001  0.185 | -0.216
## 7                   | -0.175  0.066  0.029 |  0.098  0.021  0.009 |  0.148
## 8                   | -0.125  0.034  0.014 |  0.139  0.042  0.017 |  0.213
## 9                   |  0.447  0.431  0.163 | -0.239  0.126  0.046 |  0.076
## 10                  |  0.559  0.675  0.229 | -0.326  0.234  0.077 | -0.156
##                        ctr   cos2  
## 1                    0.085  0.011 |
## 2                    0.164  0.068 |
## 3                    0.006  0.001 |
## 4                    0.122  0.025 |
## 5                    0.447  0.173 |
## 6                    0.142  0.019 |
## 7                    0.067  0.021 |
## 8                    0.139  0.039 |
## 9                    0.018  0.005 |
## 10                   0.075  0.018 |
## 
## Categories (the 10 first)
##                        Dim.1    ctr   cos2 v.test    Dim.2    ctr   cos2 v.test
## dinner              |  0.590  1.129  0.026  2.801 |  1.055  3.690  0.084  5.004
## Not.dinner          | -0.044  0.085  0.026 -2.801 | -0.079  0.278  0.084 -5.004
## effect on health    |  0.315  1.012  0.028  2.896 |  0.415  1.793  0.049  3.808
## No.effect on health | -0.089  0.286  0.028 -2.896 | -0.117  0.506  0.049 -3.808
## +2/day              |  0.138  0.373  0.014  2.044 | -0.578  6.698  0.245 -8.562
## 1 to 2/week         |  0.096  0.062  0.002  0.687 |  0.889  5.494  0.136  6.374
## 1/day               | -0.311  1.416  0.045 -3.659 |  0.467  3.277  0.101  5.502
## 3 to 6/week         |  0.229  0.275  0.007  1.416 | -0.298  0.477  0.011 -1.842
## evening             |  0.132  0.276  0.009  1.648 | -0.277  1.246  0.040 -3.460
## Not.evening         | -0.069  0.144  0.009 -1.648 |  0.145  0.651  0.040  3.460
##                        Dim.3    ctr   cos2 v.test  
## dinner              | -0.266  0.323  0.005 -1.260 |
## Not.dinner          |  0.020  0.024  0.005  1.260 |
## effect on health    |  0.787  8.908  0.175  7.226 |
## No.effect on health | -0.222  2.513  0.175 -7.226 |
## +2/day              | -0.225  1.405  0.037 -3.338 |
## 1 to 2/week         |  0.948  8.623  0.155  6.797 |
## 1/day               | -0.451  4.212  0.094 -5.309 |
## 3 to 6/week         |  0.875  5.669  0.098  5.407 |
## evening             |  0.433  4.202  0.098  5.409 |
## Not.evening         | -0.226  2.197  0.098 -5.409 |
## 
## Categorical variables (eta2)
##                       Dim.1 Dim.2 Dim.3  
## dinner              | 0.026 0.084 0.005 |
## effect.on.health    | 0.028 0.049 0.175 |
## frequency           | 0.046 0.337 0.304 |
## evening             | 0.009 0.040 0.098 |
## home                | 0.001 0.023 0.035 |
## healthy             | 0.015 0.016 0.489 |
## how                 | 0.587 0.297 0.002 |
## lunch               | 0.000 0.152 0.049 |
## pub                 | 0.000 0.167 0.008 |
## price               | 0.617 0.224 0.148 |
summary(mca3)
## 
## Call:
## MCA(X = tea_smaller2, graph = FALSE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6   Dim.7
## Variance               0.172   0.133   0.119   0.113   0.101   0.094   0.079
## % of var.             17.193  13.350  11.935  11.267  10.062   9.371   7.921
## Cumulative % of var.  17.193  30.543  42.478  53.744  63.806  73.177  81.098
##                        Dim.8   Dim.9  Dim.10
## Variance               0.070   0.064   0.055
## % of var.              7.014   6.369   5.519
## Cumulative % of var.  88.112  94.481 100.000
## 
## Individuals (the 10 first)
##                        Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3
## 1                   | -0.705  0.964  0.496 | -0.036  0.003  0.001 |  0.267
## 2                   | -0.061  0.007  0.004 |  0.510  0.649  0.266 |  0.119
## 3                   | -0.160  0.050  0.034 | -0.400  0.400  0.213 |  0.317
## 4                   | -0.333  0.214  0.157 | -0.289  0.208  0.118 |  0.218
## 5                   | -0.512  0.509  0.251 | -0.050  0.006  0.002 |  0.263
## 6                   | -0.666  0.860  0.494 | -0.410  0.420  0.187 |  0.415
## 7                   | -0.666  0.860  0.494 | -0.410  0.420  0.187 |  0.415
## 8                   |  0.165  0.053  0.041 | -0.600  0.899  0.542 | -0.272
## 9                   | -0.636  0.783  0.462 | -0.272  0.185  0.085 | -0.150
## 10                  | -0.431  0.361  0.224 | -0.253  0.159  0.077 |  0.606
##                        ctr   cos2  
## 1                    0.198  0.071 |
## 2                    0.039  0.014 |
## 3                    0.281  0.133 |
## 4                    0.133  0.067 |
## 5                    0.193  0.066 |
## 6                    0.481  0.192 |
## 7                    0.481  0.192 |
## 8                    0.207  0.112 |
## 9                    0.063  0.026 |
## 10                   1.025  0.441 |
## 
## Categories (the 10 first)
##                         Dim.1     ctr    cos2  v.test     Dim.2     ctr    cos2
## diuretic            |   0.409   5.643   0.231   8.311 |   0.241   2.533   0.081
## Not.diuretic        |  -0.565   7.792   0.231  -8.311 |  -0.333   3.498   0.081
## feminine            |   0.799  15.957   0.481  11.997 |  -0.147   0.694   0.016
## Not.feminine        |  -0.603  12.038   0.481 -11.997 |   0.111   0.524   0.016
## healthy             |   0.239   2.323   0.133   6.309 |  -0.120   0.757   0.034
## Not.healthy         |  -0.557   5.421   0.133  -6.309 |   0.280   1.766   0.034
## exciting            |   0.069   0.108   0.003   0.951 |   0.993  28.589   0.622
## No.exciting         |  -0.044   0.068   0.003  -0.951 |  -0.626  18.023   0.622
## iron absorption     |   0.767   3.536   0.068   4.503 |   0.130   0.130   0.002
## Not.iron absorption |  -0.088   0.407   0.068  -4.503 |  -0.015   0.015   0.002
##                      v.test     Dim.3     ctr    cos2  v.test  
## diuretic              4.906 |   0.277   3.731   0.106   5.630 |
## Not.diuretic         -4.906 |  -0.383   5.153   0.106  -5.630 |
## feminine             -2.205 |  -0.373   5.022   0.105  -5.607 |
## Not.feminine          2.205 |   0.282   3.789   0.105   5.607 |
## healthy              -3.173 |   0.369   7.977   0.317   9.741 |
## Not.healthy           3.173 |  -0.861  18.614   0.317  -9.741 |
## exciting             13.640 |  -0.133   0.575   0.011  -1.829 |
## No.exciting         -13.640 |   0.084   0.363   0.011   1.829 |
## iron absorption       0.762 |  -0.065   0.036   0.000  -0.379 |
## Not.iron absorption  -0.762 |   0.007   0.004   0.000   0.379 |
## 
## Categorical variables (eta2)
##                       Dim.1 Dim.2 Dim.3  
## diuretic            | 0.231 0.081 0.106 |
## feminine            | 0.481 0.016 0.105 |
## healthy             | 0.133 0.034 0.317 |
## exciting            | 0.003 0.622 0.011 |
## iron.absorption     | 0.068 0.002 0.000 |
## escape.exoticism    | 0.053 0.029 0.000 |
## relaxing            | 0.006 0.438 0.062 |
## sophisticated       | 0.173 0.002 0.106 |
## sex                 | 0.304 0.070 0.240 |
## slimming            | 0.267 0.041 0.245 |
# visualize MCA 
plot(mca2, invisible=c("ind"), graph.type = "classic",col.var = rainbow(ncol(tea_new_2)))

plot(mca3, invisible=c("ind"), graph.type = "classic",col.var = rainbow(ncol(tea_new_2)))

Conclusion of MCAs

MCA 2: The categories “tea shop,” “upscale price,” and “unpackaged” appear closely related, indicating they collectively represent a particular type of behavior or consumer preference. Similarly, “tea bag” and “chain store” seem to indicate another pattern within the dataset.

MCA 3: Categories associated with tea benefits like “iron absorption,” “healthy,” “sophisticated,” and “diuretic” form a cluster, suggesting they share similar characteristics or appeal to a specific segment of consumers. Oppositely, tea without special benefits stands apart, indicating a different category or behavior. Moreover, the two categories “exciting” and “not relaxing” but also the two categories “feminine tea” and “female” show similarities, aligning with their expected conceptual correspondence in this context.

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.


{r)

Chapter 5

Data wrangling

Load the data sets (BPRS and RATS) and analyze the dataset

library(tidyverse)
library(tidyr)
library(ggplot2)
# Save the url in a object

url_data1 <- "https://raw.githubusercontent.com/KimmoVehkalahti/MABS/master/Examples/data/BPRS.txt"

url_data2 <- "https://raw.githubusercontent.com/KimmoVehkalahti/MABS/master/Examples/data/rats.txt"

# read the data and save it as an object
BPRS <- read.table(url_data1, header = TRUE, sep = ' ')
RATS <- read.table(url_data2, sep ='\t', header = T)

Check the variable names, view the data contents and structures, and create some brief summaries of the variables , so that you understand the point of the wide form data. (1 point)

#BPRS dataset
str(BPRS)
## 'data.frame':    40 obs. of  11 variables:
##  $ treatment: int  1 1 1 1 1 1 1 1 1 1 ...
##  $ subject  : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ week0    : int  42 58 54 55 72 48 71 30 41 57 ...
##  $ week1    : int  36 68 55 77 75 43 61 36 43 51 ...
##  $ week2    : int  36 61 41 49 72 41 47 38 39 51 ...
##  $ week3    : int  43 55 38 54 65 38 30 38 35 55 ...
##  $ week4    : int  41 43 43 56 50 36 27 31 28 53 ...
##  $ week5    : int  40 34 28 50 39 29 40 26 22 43 ...
##  $ week6    : int  38 28 29 47 32 33 30 26 20 43 ...
##  $ week7    : int  47 28 25 42 38 27 31 25 23 39 ...
##  $ week8    : int  51 28 24 46 32 25 31 24 21 32 ...
glimpse(BPRS)
## Rows: 40
## Columns: 11
## $ treatment <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ subject   <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1…
## $ week0     <int> 42, 58, 54, 55, 72, 48, 71, 30, 41, 57, 30, 55, 36, 38, 66, …
## $ week1     <int> 36, 68, 55, 77, 75, 43, 61, 36, 43, 51, 34, 52, 32, 35, 68, …
## $ week2     <int> 36, 61, 41, 49, 72, 41, 47, 38, 39, 51, 34, 49, 36, 36, 65, …
## $ week3     <int> 43, 55, 38, 54, 65, 38, 30, 38, 35, 55, 41, 54, 31, 34, 49, …
## $ week4     <int> 41, 43, 43, 56, 50, 36, 27, 31, 28, 53, 36, 48, 25, 25, 36, …
## $ week5     <int> 40, 34, 28, 50, 39, 29, 40, 26, 22, 43, 36, 43, 25, 27, 32, …
## $ week6     <int> 38, 28, 29, 47, 32, 33, 30, 26, 20, 43, 38, 37, 21, 25, 27, …
## $ week7     <int> 47, 28, 25, 42, 38, 27, 31, 25, 23, 39, 36, 36, 19, 26, 30, …
## $ week8     <int> 51, 28, 24, 46, 32, 25, 31, 24, 21, 32, 36, 31, 22, 26, 37, …
dim(BPRS)
## [1] 40 11
summary(BPRS)
##    treatment      subject          week0           week1           week2     
##  Min.   :1.0   Min.   : 1.00   Min.   :24.00   Min.   :23.00   Min.   :26.0  
##  1st Qu.:1.0   1st Qu.: 5.75   1st Qu.:38.00   1st Qu.:35.00   1st Qu.:32.0  
##  Median :1.5   Median :10.50   Median :46.00   Median :41.00   Median :38.0  
##  Mean   :1.5   Mean   :10.50   Mean   :48.00   Mean   :46.33   Mean   :41.7  
##  3rd Qu.:2.0   3rd Qu.:15.25   3rd Qu.:58.25   3rd Qu.:54.25   3rd Qu.:49.0  
##  Max.   :2.0   Max.   :20.00   Max.   :78.00   Max.   :95.00   Max.   :75.0  
##      week3           week4           week5           week6           week7     
##  Min.   :24.00   Min.   :20.00   Min.   :20.00   Min.   :19.00   Min.   :18.0  
##  1st Qu.:29.75   1st Qu.:28.00   1st Qu.:26.00   1st Qu.:22.75   1st Qu.:23.0  
##  Median :36.50   Median :34.50   Median :30.50   Median :28.50   Median :30.0  
##  Mean   :39.15   Mean   :36.35   Mean   :32.55   Mean   :31.23   Mean   :32.2  
##  3rd Qu.:44.50   3rd Qu.:43.00   3rd Qu.:38.00   3rd Qu.:37.00   3rd Qu.:38.0  
##  Max.   :76.00   Max.   :66.00   Max.   :64.00   Max.   :64.00   Max.   :62.0  
##      week8      
##  Min.   :20.00  
##  1st Qu.:22.75  
##  Median :28.00  
##  Mean   :31.43  
##  3rd Qu.:35.25  
##  Max.   :75.00

The data shows analyze of BPRS for different participants (20). There are no groups for participants, but we have 2 types of treatment which we can consider as 2 groups. The max BPRS over the weeks is 95 and the min is 18. The PBRS changes every day in term of min, max, median and mean. The variables treatment and subject are integer -> need to be changed to factor for further analyzes. We have 40 rows 40 and 11 columns -> format is wide -> needs to be changed to long format.

#RATS
str(RATS)
## 'data.frame':    16 obs. of  13 variables:
##  $ ID   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Group: int  1 1 1 1 1 1 1 1 2 2 ...
##  $ WD1  : int  240 225 245 260 255 260 275 245 410 405 ...
##  $ WD8  : int  250 230 250 255 260 265 275 255 415 420 ...
##  $ WD15 : int  255 230 250 255 255 270 260 260 425 430 ...
##  $ WD22 : int  260 232 255 265 270 275 270 268 428 440 ...
##  $ WD29 : int  262 240 262 265 270 275 273 270 438 448 ...
##  $ WD36 : int  258 240 265 268 273 277 274 265 443 460 ...
##  $ WD43 : int  266 243 267 270 274 278 276 265 442 458 ...
##  $ WD44 : int  266 244 267 272 273 278 271 267 446 464 ...
##  $ WD50 : int  265 238 264 274 276 284 282 273 456 475 ...
##  $ WD57 : int  272 247 268 273 278 279 281 274 468 484 ...
##  $ WD64 : int  278 245 269 275 280 281 284 278 478 496 ...
glimpse(RATS)
## Rows: 16
## Columns: 13
## $ ID    <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16
## $ Group <int> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3
## $ WD1   <int> 240, 225, 245, 260, 255, 260, 275, 245, 410, 405, 445, 555, 470,…
## $ WD8   <int> 250, 230, 250, 255, 260, 265, 275, 255, 415, 420, 445, 560, 465,…
## $ WD15  <int> 255, 230, 250, 255, 255, 270, 260, 260, 425, 430, 450, 565, 475,…
## $ WD22  <int> 260, 232, 255, 265, 270, 275, 270, 268, 428, 440, 452, 580, 485,…
## $ WD29  <int> 262, 240, 262, 265, 270, 275, 273, 270, 438, 448, 455, 590, 487,…
## $ WD36  <int> 258, 240, 265, 268, 273, 277, 274, 265, 443, 460, 455, 597, 493,…
## $ WD43  <int> 266, 243, 267, 270, 274, 278, 276, 265, 442, 458, 451, 595, 493,…
## $ WD44  <int> 266, 244, 267, 272, 273, 278, 271, 267, 446, 464, 450, 595, 504,…
## $ WD50  <int> 265, 238, 264, 274, 276, 284, 282, 273, 456, 475, 462, 612, 507,…
## $ WD57  <int> 272, 247, 268, 273, 278, 279, 281, 274, 468, 484, 466, 618, 518,…
## $ WD64  <int> 278, 245, 269, 275, 280, 281, 284, 278, 478, 496, 472, 628, 525,…
dim(RATS)
## [1] 16 13
summary(RATS)
##        ID            Group           WD1             WD8             WD15      
##  Min.   : 1.00   Min.   :1.00   Min.   :225.0   Min.   :230.0   Min.   :230.0  
##  1st Qu.: 4.75   1st Qu.:1.00   1st Qu.:252.5   1st Qu.:255.0   1st Qu.:255.0  
##  Median : 8.50   Median :1.50   Median :340.0   Median :345.0   Median :347.5  
##  Mean   : 8.50   Mean   :1.75   Mean   :365.9   Mean   :369.1   Mean   :372.5  
##  3rd Qu.:12.25   3rd Qu.:2.25   3rd Qu.:480.0   3rd Qu.:476.2   3rd Qu.:486.2  
##  Max.   :16.00   Max.   :3.00   Max.   :555.0   Max.   :560.0   Max.   :565.0  
##       WD22            WD29            WD36            WD43      
##  Min.   :232.0   Min.   :240.0   Min.   :240.0   Min.   :243.0  
##  1st Qu.:267.2   1st Qu.:268.8   1st Qu.:267.2   1st Qu.:269.2  
##  Median :351.5   Median :356.5   Median :360.0   Median :360.0  
##  Mean   :379.2   Mean   :383.9   Mean   :387.0   Mean   :386.0  
##  3rd Qu.:492.5   3rd Qu.:497.8   3rd Qu.:504.2   3rd Qu.:501.0  
##  Max.   :580.0   Max.   :590.0   Max.   :597.0   Max.   :595.0  
##       WD44            WD50            WD57            WD64      
##  Min.   :244.0   Min.   :238.0   Min.   :247.0   Min.   :245.0  
##  1st Qu.:270.0   1st Qu.:273.8   1st Qu.:273.8   1st Qu.:278.0  
##  Median :362.0   Median :370.0   Median :373.5   Median :378.0  
##  Mean   :388.3   Mean   :394.6   Mean   :398.6   Mean   :404.1  
##  3rd Qu.:510.5   3rd Qu.:516.0   3rd Qu.:524.5   3rd Qu.:530.8  
##  Max.   :595.0   Max.   :612.0   Max.   :618.0   Max.   :628.0

The data shows weight for different rats (16) at different time from 1 to 64. Each rat belongs to a different group. There are 3 groups. The max weight over the weeks is 628 and the min weight is 225. The mean changes every day in term of min, max, median and mean. The variables ID and group are integer -> need to be changed to factor for further analyzes. There are 16 rows and 13 columns -> format is wide and needs to be changed to long format.

2. Convert the categorical variables to factors

#BPRS
BPRS$treatment <- factor(BPRS$treatment)
BPRS$subject <- factor(BPRS$subject)

# RATS
RATS$ID <- factor(RATS$ID )
RATS$Group <- factor(RATS$Group)

3. Convert the table to long form

Add a week variable to BPRS and a Time variable to RATS > reorder by period of time (week or WD) > add one column where the week or WD is an integer.

BPRS

BPRS_long <- pivot_longer(BPRS, cols = -c(treatment, subject), # columns are not pivoted, they are returned as they are
                       names_to = "weeks", values_to = "bprs") %>% # names_to is the new column created for the former column titles, the values_to is the new column created for the values
          arrange(weeks) %>% # order by weeks variable  
            mutate(week = as.integer(substr(weeks,5,5)))

RATS

RATS_long <- pivot_longer(RATS, cols = -c(ID, Group), 
                       names_to = "WD", values_to = "weight") %>% 
  arrange(WD) %>% 
  mutate(Time = as.integer(substr(WD,3,4)))

Compare wide vs long format

Now, take a serious look at the new data sets and compare them with their wide form versions: Check the variable names, view the data contents and structures, and create some brief summaries of the variables. Make sure that you understand the point of the long form data and the crucial difference between the wide and the long forms before proceeding the to Analysis exercise. (2 points)

Long format data allow for a more detailed examination of individual measurements, making it easier to assess the impact of various factors on the outcome. This format also facilitates the application of statistical methods designed for longitudinal or repeated measures analysis.

BPRS

str(BPRS_long)
## tibble [360 × 5] (S3: tbl_df/tbl/data.frame)
##  $ treatment: Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ subject  : Factor w/ 20 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ weeks    : chr [1:360] "week0" "week0" "week0" "week0" ...
##  $ bprs     : int [1:360] 42 58 54 55 72 48 71 30 41 57 ...
##  $ week     : int [1:360] 0 0 0 0 0 0 0 0 0 0 ...
glimpse(BPRS_long)
## Rows: 360
## Columns: 5
## $ treatment <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ subject   <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1…
## $ weeks     <chr> "week0", "week0", "week0", "week0", "week0", "week0", "week0…
## $ bprs      <int> 42, 58, 54, 55, 72, 48, 71, 30, 41, 57, 30, 55, 36, 38, 66, …
## $ week      <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
dim(BPRS_long)
## [1] 360   5
summary(BPRS_long)
##  treatment    subject       weeks                bprs            week  
##  1:180     1      : 18   Length:360         Min.   :18.00   Min.   :0  
##  2:180     2      : 18   Class :character   1st Qu.:27.00   1st Qu.:2  
##            3      : 18   Mode  :character   Median :35.00   Median :4  
##            4      : 18                      Mean   :37.66   Mean   :4  
##            5      : 18                      3rd Qu.:43.00   3rd Qu.:6  
##            6      : 18                      Max.   :95.00   Max.   :8  
##            (Other):252

The main difference with the long table is that we now have one entry for each measure for a week, subject, treatment type like if each entry would be an independent one. We can now consider the impact of variables like treatment types, weeks, subjects and their BPRS score. We now have one mean, and one median for the total of the study. Treatment and subject are now factor variables (dummy variables). This transformation helps represent categorical information in a format that can be used effectively in statistical analyses. Week is now an integer. This allows for mathematical operations and analysis involving these variables.

RATS

str(RATS_long)
## tibble [176 × 5] (S3: tbl_df/tbl/data.frame)
##  $ ID    : Factor w/ 16 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ Group : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 2 2 ...
##  $ WD    : chr [1:176] "WD1" "WD1" "WD1" "WD1" ...
##  $ weight: int [1:176] 240 225 245 260 255 260 275 245 410 405 ...
##  $ Time  : int [1:176] 1 1 1 1 1 1 1 1 1 1 ...
glimpse(RATS_long)
## Rows: 176
## Columns: 5
## $ ID     <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1, 2, 3,…
## $ Group  <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 1, 1, 1, 1, 1, …
## $ WD     <chr> "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", …
## $ weight <int> 240, 225, 245, 260, 255, 260, 275, 245, 410, 405, 445, 555, 470…
## $ Time   <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 15, 15, 15, 15,…
dim(RATS_long)
## [1] 176   5
summary(RATS_long)
##        ID      Group       WD                weight           Time      
##  1      : 11   1:88   Length:176         Min.   :225.0   Min.   : 1.00  
##  2      : 11   2:44   Class :character   1st Qu.:267.0   1st Qu.:15.00  
##  3      : 11   3:44   Mode  :character   Median :344.5   Median :36.00  
##  4      : 11                             Mean   :384.5   Mean   :33.55  
##  5      : 11                             3rd Qu.:511.2   3rd Qu.:50.00  
##  6      : 11                             Max.   :628.0   Max.   :64.00  
##  (Other):110

The main difference with the long table is that we now have one entry for each measure for a time, rat, group like if each entry would be an independent one. We can now do analyzes like if we would have only one rat and different situations based on variables like group, time which impact the rat’s weight. We now have one mean, and one median for the total of the study: 384.5 mean, 344.5 median. ID and Group are now factor variables (dummy variables). This transformation helps represent categorical information in a format that can be used effectively in statistical analyses. Time is now an integer. This allows for mathematical operations and analysis involving these variables.

Model with a baseline as a covariate - dataset RATS

Implement the analyses of Chapter 8 of MABS, using the R codes of Exercise Set 6: Meet and Repeat: PART I but using the RATS data (from Chapter 9 and Meet and Repeat: PART II). (0-7 points: 0-4 points for graphs or analysis results + 0-3 points for their interpretations)

Plot the data

#Access the packages
library(ggplot2)
library(tidyr)

# Draw the plot
ggplot(RATS_long, aes(x = Time, y = weight, linetype = ID)) +
  geom_line() +
  scale_linetype_manual(values = rep(1:10, times=4)) +
  facet_grid(. ~ Group, labeller = label_both) +
  theme(legend.position = "none") 

We can see a huge difference between each group in term of weight from the beginning and this difference seems to stay over time so data points might be dependent to each other. The group 2 seems to have one rat’s weight very different than the others. Because of the massive difference let’s standardize the data so that the starting point does not influence the results and compare groups with standardized data.

Analyze of standardized means of rat’s weight per group and per Time

# Standardise the variable weight
RATS2 <- RATS_long %>%
  group_by(Time) %>% # perform the next code lines per week
  mutate(stdweight = (weight - mean(weight)) / sd(weight)) %>% # add up the bprs result per week
  ungroup() # ungroup for the next use of this object.

# Glimpse the data
glimpse(RATS2)
## Rows: 176
## Columns: 6
## $ ID        <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1, 2,…
## $ Group     <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 1, 1, 1, 1, …
## $ WD        <chr> "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1…
## $ weight    <int> 240, 225, 245, 260, 255, 260, 275, 245, 410, 405, 445, 555, …
## $ Time      <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 15, 15, 15, …
## $ stdweight <dbl> -1.0011429, -1.1203857, -0.9613953, -0.8421525, -0.8819001, …
# Draw the plot
ggplot(RATS2, aes(x = Time, y = stdweight, linetype = ID)) +
  geom_line() +
  scale_linetype_manual(values = rep(1:10, times=4)) +
  facet_grid(. ~ Group, labeller = label_both) +
  theme(legend.position = "none") 

Analyze of mean and standard deviation of the rats weights per group and per Time.

# Summary data with mean and standard error of weight 
RATS3 <- RATS_long %>%
  group_by(Group, Time) %>%
  summarise(mean = mean(weight),
            sd = sd(weight),
            n = n()) %>%
  mutate(se = sd / sqrt(n)) %>%
  ungroup()
## `summarise()` has grouped output by 'Group'. You can override using the
## `.groups` argument.
# Glimpse the data
glimpse(RATS3)
## Rows: 33
## Columns: 6
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…
## $ Time  <int> 1, 8, 15, 22, 29, 36, 43, 44, 50, 57, 64, 1, 8, 15, 22, 29, 36, …
## $ mean  <dbl> 250.625, 255.000, 254.375, 261.875, 264.625, 265.000, 267.375, 2…
## $ sd    <dbl> 15.22158, 13.09307, 11.47591, 13.60081, 11.05748, 11.78377, 10.9…
## $ n     <int> 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4…
## $ se    <dbl> 5.381640, 4.629100, 4.057346, 4.808614, 3.909409, 4.166190, 3.87…
# Plot the mean per group of rats
ggplot(RATS3, aes(x = Time, y = mean, linetype = Group, shape = Group, color = Group)) +
  geom_line() +
  scale_linetype_manual(values = c(1,2,3)) + # set the 3 different lines for the 3 groups
  geom_point(size=3) +
  scale_shape_manual(values = c(1,2,3)) + # set the 3 different point shapes for the 3 groups
  geom_errorbar(aes(ymin=mean-se, ymax=mean+se, linetype="1"), width=0.3) +
  theme(legend.position = c(0.8,0.8)) +
  scale_y_continuous(name = "mean(bprs) +/- se(bprs)")

This plot shos well the difference of weight per group but also withing group. It also shows well that the evolution over time depends on the starting point weight of a rat.

Analyze of weights mean and standard deviation per group

# Create a summary data by group and ID with mean as the summary variable (ignoring baseline Time 0)
RATS4 <- RATS_long %>%
  filter(Time > 1) %>% # post first day. 
  group_by(Group, ID) %>%
  summarise( mean=mean(weight) ) %>%
  ungroup()
## `summarise()` has grouped output by 'Group'. You can override using the
## `.groups` argument.
# Glimpse the data
glimpse(RATS4)
## Rows: 16
## Columns: 3
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3
## $ ID    <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16
## $ mean  <dbl> 263.2, 238.9, 261.7, 267.2, 270.9, 276.2, 274.6, 267.5, 443.9, 4…
# Draw a boxplot of the mean weight vs group
ggplot(RATS4, aes(x = Group, y = mean)) +
  geom_boxplot() +
  stat_summary(fun = "mean", geom = "point", shape=23, size=4, fill = "white") +
  scale_y_continuous(name = "mean(bprs) for the studied Time period")

We can see outliers in group 1 - a very small rat We can see an outlier in group 2 - a very big rat. We can see an outlier in group 3 - a smaller rat than the rest We can also notify the big differences between groups and between rats. It is possible to filter the outliers but this needs to be carefully done, especially if they represent natural variation within the data. Outliers can be legitimate data points that may carry important information or represent unique cases (biological variability for instance). However, if they are due to measurement errors or anomalies, it might be appropriate to handle them. In that case, let’s consider that the outliers in group 1 and 3 are a result of a rare and unrepeatable event that substantially skews the observed results but doesn’t reflect the studied phenomenon’s reality.

Analyze of weights mean and standard deviation per group- Removing the outliers

# Create a new data by filtering the outliers and adjust the ggplot code the draw the plot again with the new data
RATS4_2 <- RATS4 %>% 
  filter(mean > 240) %>% 
  filter(mean < 540)

ggplot(RATS4_2, aes(x = Group, y = mean)) +
  geom_boxplot() +
  stat_summary(fun = "mean", geom = "point", shape=23, size=4, fill = "white") +
  scale_y_continuous(name = "for the studied Time period")

After outliers removal, the plot exhibits a more normally distributed pattern, making it easier to visualize trends and patterns in the data.

Let’s create a linear model

# Save the Time 0 weigh of rats as an object to use it in the model
RATS4_3 <- RATS4 %>%
  mutate(baseline = RATS$WD1)

# Fit the linear model with the mean as the response (variables to influence weight are the baseline: situation in WD1 and the Group)
# We don't consider the rats individuality but more the group differences that impact the results but also their weight at the beginning of the study
fit_RATS <- lm(mean ~ baseline + Group, data = RATS4_3)

# summary of the model
summary(fit_RATS)
## 
## Call:
## lm(formula = mean ~ baseline + Group, data = RATS4_3)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -23.905  -4.194   2.190   7.577  14.800 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 33.16375   21.87657   1.516   0.1554    
## baseline     0.92513    0.08572  10.793 1.56e-07 ***
## Group2      34.85753   18.82308   1.852   0.0888 .  
## Group3      23.67526   23.25324   1.018   0.3287    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.68 on 12 degrees of freedom
## Multiple R-squared:  0.9936, Adjusted R-squared:  0.992 
## F-statistic: 622.1 on 3 and 12 DF,  p-value: 1.989e-13
anova(fit_RATS)
## Analysis of Variance Table
## 
## Response: mean
##           Df Sum Sq Mean Sq   F value   Pr(>F)    
## baseline   1 253625  253625 1859.8201 1.57e-14 ***
## Group      2    879     439    3.2219  0.07586 .  
## Residuals 12   1636     136                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
AIC(fit_RATS) # 129
## [1] 129.4492
BIC(fit_RATS) # 133
## [1] 133.3121
plot(fit_RATS)

# just for the try, let's remove baseline and see what is happening
# this model explain 92% but it is because there is a strong dependence between the data from one group and from each rat. Therefore there is a correlation that is not justified. This correlation might indicate interdependence that are not justified.
fit_RATS2 <- lm(mean ~ Group, data = RATS4_3)
summary(fit_RATS2)
## 
## Call:
## lm(formula = mean ~ Group, data = RATS4_3)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -43.900 -27.169   2.325   9.069 106.200 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   265.02      12.98  20.419 2.92e-11 ***
## Group2        222.78      22.48   9.909 2.00e-07 ***
## Group3        262.48      22.48  11.675 2.90e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 36.71 on 13 degrees of freedom
## Multiple R-squared:  0.9316, Adjusted R-squared:  0.9211 
## F-statistic: 88.52 on 2 and 13 DF,  p-value: 2.679e-08
anova(fit_RATS2)
## Analysis of Variance Table
## 
## Response: mean
##           Df Sum Sq Mean Sq F value    Pr(>F)    
## Group      2 238620  119310  88.525 2.679e-08 ***
## Residuals 13  17521    1348                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
AIC(fit_RATS2) # 165
## [1] 165.383
BIC(fit_RATS2) # 168
## [1] 168.4733

FIRST MODEL WITH BASELINE

- The T value of the baseline is 10.8, it is very high so it means this baseline impacts the studies weight (p is very small so it is significant). The estimate for the baseline variable remains significantly different from zero, even when considering its standard error, reinforcing its importance in explaining weight changes.

- Group Impact: Group 1 serves as reference. The coefficients for Group2 and Group3 which represent the differences between those groups and the baseline, aren’t as substantial, and their impact on weight might not be statistically significant based on the p-values.

- The spread of residuals looks decent, with the majority being relatively close to zero. However, there are a few larger residuals (especially the maximum value - the median is 2.19, and the values between the 1Q and 3Q range from -4.194 to 7.577), indicating potential areas where the model might not fit perfectly and where the fitted values are largely underestimated (but also overestimated). This is also very visible when analyzing the residuals plot.

- Similarly, the anova results suggests that the baseline has a strong and significant effect on the mean. Group might have some influence and shows a p-value of 0.076, indicating a relatively high significance threshold.

Random Effects in Longitudinal Analysis - BPRS dataset

Implement the analyses of Chapter 9 of MABS, using the R codes of Exercise Set 6: Meet and Repeat: PART II, but using the BPRS data (from Chapter 8 and Meet and Repeat: PART I). (0-8 points: 0-4 points for graphs or analysis results + 0-4 points for their interpretations)

Plot the data

# Draw a plot for each subject and its bprs per week
ggplot(BPRS_long, aes(x = week, y = bprs, linetype = subject)) +
  geom_line() +
  scale_linetype_manual(values = rep(1:10, times=4)) +
  facet_grid(. ~ treatment, labeller = label_both) +
  scale_y_continuous(name = "bprs over time")

Insights: Variations between subject are broad. We can see that most of the subject has a decreasing bprs over week no matter the treatment. Mostly the time and potentially the subject starting point seems to affect the result more than the treatment type itself. Treatment 2 subjects seem to have higher ‘pbrs’ from the beginning and it seems to influence the rest of the weeks.

Creation of a linear model

# create a simple regression model
BPRS_model1 <- lm(bprs ~ week + treatment, data = BPRS_long) 

# print out a summary of the model
summary(BPRS_model1) 
## 
## Call:
## lm(formula = bprs ~ week + treatment, data = BPRS_long)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -22.454  -8.965  -3.196   7.002  50.244 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  46.4539     1.3670  33.982   <2e-16 ***
## week         -2.2704     0.2524  -8.995   <2e-16 ***
## treatment2    0.5722     1.3034   0.439    0.661    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.37 on 357 degrees of freedom
## Multiple R-squared:  0.1851, Adjusted R-squared:  0.1806 
## F-statistic: 40.55 on 2 and 357 DF,  p-value: < 2.2e-16
anova(BPRS_model1) 
## Analysis of Variance Table
## 
## Response: bprs
##            Df Sum Sq Mean Sq F value Pr(>F)    
## week        1  12372 12371.5 80.9143 <2e-16 ***
## treatment   1     29    29.5  0.1927 0.6609    
## Residuals 357  54584   152.9                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
BIC(BPRS_model1) # 2852
## [1] 2852.881
AIC(BPRS_model1) # 2837
## [1] 2837.337

Insights: the model explains only 18% of the variance which is very small. It suggests that a significant portion of the variability in the ‘bprs’ remains unexplained by the predictors in the model. Week has a big effect on the BPRS and It is significant. The high residual variability indicates the model is not effective to explain the ‘bprs’. It may be worth exploring other factors that could better explain the variability in ‘bprs’.

Adding random effects to the model:

Notes:

- Fixed Effects: Variables we are interested in and we want to estimate the effects or coefficients directly. They have a systematic impact on the response variable. Fixed effects are often the primary focus of analysis, while random effects account for additional variability that isn’t explicitly modeled as a primary predictor but impacts the variance.

- Random Effects: Variables we think impact the variability but not in a systemic manner.

For example, in a study analyzing subject’s ‘bprs’ based on different treatment, the subjects themselves might be included as a random effect because we’re not directly interested in the effect of the subjects, but we acknowledge that they may introduce variability in their ‘bprs’ that is not systematically related to other predictors.

library(lme4) # to add random effects
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
library(lmerTest) # to get p value with random effects
## 
## Attaching package: 'lmerTest'
## The following object is masked from 'package:lme4':
## 
##     lmer
## The following object is masked from 'package:stats':
## 
##     step
# Create a random intercept model with subject only. The week does not effect on the subject. 
BPRS_model2 <- lmer(bprs ~ week + treatment + (1 | subject), data = BPRS_long, REML = FALSE)

# Summary of the model
summary(BPRS_model2)
## Linear mixed model fit by maximum likelihood . t-tests use Satterthwaite's
##   method [lmerModLmerTest]
## Formula: bprs ~ week + treatment + (1 | subject)
##    Data: BPRS_long
## 
##      AIC      BIC   logLik deviance df.resid 
##   2748.7   2768.1  -1369.4   2738.7      355 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.0481 -0.6749 -0.1361  0.4813  3.4855 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  subject  (Intercept)  47.41    6.885  
##  Residual             104.21   10.208  
## Number of obs: 360, groups:  subject, 20
## 
## Fixed effects:
##             Estimate Std. Error       df t value Pr(>|t|)    
## (Intercept)  46.4539     1.9090  37.2392  24.334   <2e-16 ***
## week         -2.2704     0.2084 340.0000 -10.896   <2e-16 ***
## treatment2    0.5722     1.0761 340.0000   0.532    0.595    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##            (Intr) week  
## week       -0.437       
## treatment2 -0.282  0.000
anova(BPRS_model2) 
## Type III Analysis of Variance Table with Satterthwaite's method
##            Sum Sq Mean Sq NumDF DenDF  F value Pr(>F)    
## week      12371.5 12371.5     1   340 118.7136 <2e-16 ***
## treatment    29.5    29.5     1   340   0.2828 0.5952    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
BIC(BPRS_model2) # 2768
## [1] 2768.143
AIC(BPRS_model2) # 2748
## [1] 2748.712

From the F values, it seems that the “week” factor has a much higher F value compared to the “treatment” factor, suggesting once again that the “week” factor explains more variability in the data compared to the “treatment” factor.

This model is bit better than the previous one when considering the BIC and AIC. This model does not take into consideration the fact that each subject may have different evolution of ‘bprs’ over time, therefore let’s try a new model that would take into consideration this random effect.

Mixed model with random effects - week and subject

  • Individual Variability: If each subject indeed has different baseline levels (‘bprs’ at the start), a random intercept for ‘subject’ ((1 | subject)) might be appropriate.
  • Varying Response to Time: If time affects individuals differently, incorporating random slopes for ‘week’ within ‘subject’ ((week | subject)) might be beneficial.
# create a random intercept and random slope model
library(lme4)
BPRS_model3 <- lmer(bprs ~ week + treatment + (week | subject), data = BPRS_long, REML = FALSE)

# summary of the model
summary(BPRS_model3)
## Linear mixed model fit by maximum likelihood . t-tests use Satterthwaite's
##   method [lmerModLmerTest]
## Formula: bprs ~ week + treatment + (week | subject)
##    Data: BPRS_long
## 
##      AIC      BIC   logLik deviance df.resid 
##   2745.4   2772.6  -1365.7   2731.4      353 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.8919 -0.6194 -0.0691  0.5531  3.7976 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  subject  (Intercept) 64.8222  8.0512        
##           week         0.9609  0.9802   -0.51
##  Residual             97.4305  9.8707        
## Number of obs: 360, groups:  subject, 20
## 
## Fixed effects:
##             Estimate Std. Error       df t value Pr(>|t|)    
## (Intercept)  46.4539     2.1052  22.6795  22.066  < 2e-16 ***
## week         -2.2704     0.2977  19.9991  -7.626 2.42e-07 ***
## treatment2    0.5722     1.0405 320.0005   0.550    0.583    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##            (Intr) week  
## week       -0.582       
## treatment2 -0.247  0.000
# ANOVA test on the two models.
anova(BPRS_model2, BPRS_model3)
## Data: BPRS_long
## Models:
## BPRS_model2: bprs ~ week + treatment + (1 | subject)
## BPRS_model3: bprs ~ week + treatment + (week | subject)
##             npar    AIC    BIC  logLik deviance  Chisq Df Pr(>Chisq)  
## BPRS_model2    5 2748.7 2768.1 -1369.4   2738.7                       
## BPRS_model3    7 2745.4 2772.6 -1365.7   2731.4 7.2721  2    0.02636 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# plot the residuals
plot(BPRS_model3)

In BPRS_model2: (1 | subject) represents a random intercept for each ‘subject’. The variance associated with this random intercept (subject-specific variability that is not explained by fixed effects) is estimated in this model.

In BPRS_model3: (week | subject) represents both random intercepts and random slopes for ‘week’ within each ‘subject’. This allows the ‘week’ effect to vary for each ‘subject’. This models seems to be better since we can see on the first plot that each individual has a very different ‘bprs’ slope per subject and this difference per subject exist over time.

Comparison between the 2 models:

- BPRS_model2 has 5 estimated parameters and a higher AIC (2748.7) and a lower BIC (2768.1).The BIC is lower because the BIC penalizes the more complex models (favorizing the model with 5 vs 7 parameters)

- BPRS_model3 has 7 estimated parameters and slightly lower AIC (2745.4) and a higher BIC (2772.6) compared to BPRS_model2.

- The chi-square test for model comparison results in a p-value of 0.02636 for the model3, indicating that BPRS_model3 significantly improves the model fit compared to BPRS_model2.

- BPRS_model3 has a higher log-likelihood (-1365.7) compared to BPRS_model2 (-1369.4). Therefore, based solely on the logLik values, BPRS_model3 appears to provide a better fit to the data than BPRS_model2.

- The lower deviance values of the BPRS_model3 implies that the model explains a larger proportion of the total variance in the data vs the other model.

To conclude, the BPRS_model3 seems to provide a better fit for the data compared to BPRS_model2 due to its lower AIC, deviance, higher logLik, and its significant improvement in model fit indicated by the chi-square test.

Model3 analyzes:

The model captures a significant effect of ‘week’ on ‘bprs’ scores, indicating a decrease in ‘bprs’ over time.

- The scaled residuals range from -2.8919 to 3.7976, indicating a relatively wide spread of values = heteroscedasticity in the residuals which we can also observe in the residuals plot. The scaled residuals are close to zero (around -0.0691) implies that a substantial portion of the observations have residuals near the mean. The negative scaled residuals (-2.8919) are less extreme compared to the positive scaled residuals (3.7976). It suggests that the model more often underestimates the ‘bprs’ (a positive residual (observed - predicted > 0) indicates an underestimation by the model.)

- Random Intercept (Subject-specific intercept): The estimated variance of the intercept among subjects is 64.8222, which translates to a standard deviation of about 8 points around the average BPRS score (intercept is 45.4539 in fixed effects) when all other predictors does not exist (week, treatment). This means that subjects may start with ‘bprs’ scores that differ by about 8 points on average from the overall mean and it represents around 17.6% variability relative to the intercept’s value in the BPRS scores. If we compare it to the mean, the relative standard deviation is approximately 21.23%.
This understanding supports the notion that subjects’ characteristics (not accounted for by other predictors) might influence their initial ‘bprs’ scores, and this influence could persist over time.

- The estimated variance of the effect of ‘week’ among subjects is 0.9609, with a standard deviation of 0.9802. The deviation suggests that for certain subjects, time does not significantly affect their ‘bprs’ scores but for others it does.

- Besides, there’s a negative correlation (-0.51) between the random intercept and the random slope for ‘week’ and ‘subjects’ together. The correlation indicates how much the effect on ‘bprs’ scores can vary between subjects and over time. The negative correlation suggests that subjects with higher initial ‘bprs’ tend to have a weaker decrease in ‘bprs’ over time (‘week’) compared to subjects with lower initial scores.

- Correlation of fixed effect between Intercept and ‘week’ (-0.58) is negative. As the starting/intercept ‘bprs’ scores increase, the effect of ‘week’ on ‘bprs’ scores appears to decrease. This implies that subjects with higher initial ‘bprs’ scores might experience a slower rate of change (‘week’ effect) compared to those with lower initial scores. -> similar conclusion with the random effects.

- Correlation between Intercept and ‘treatment 2’ (-0.247) is negative but a bit less than the first correlation. This indicates a moderate negative association between the effect of ‘week’ and the effect of ‘treatment2’ relative to the reference level ‘treatment1’. The presence of ‘treatment2’ might attenuate the impact of ‘week’ on ‘bprs’ scores compared to ‘treatment1’. Because we want to reduce the ‘bprs’, the ‘treatment2’ might limit the positive effect of time (‘week’) on improving ‘bprs’, therefore possibly leading to a less favorable outcome compared to ‘treatment1’. This effect might be even worst for those starting with a higher ‘pbrs’ as we saw that time affects them less in reducing the ‘bprs’. These conclusions would require further research to be validated.

Mixed model with random effects - week and subject + consider bprs change over time, based on treatment

Note: week and treatment both influences each other. We use the ‘*’ if there’s reason to believe that the effect of week on bprs differs if treatment is taken or not. If there’s no reason to believe that the relationship between week and bprs changes based on treatment, using the model week + treatment is sufficient.

# create a random intercept and random slope model with the interaction
BPRS_model4 <- lmer(bprs ~ week * treatment + (week | subject), data = BPRS_long, REML = FALSE)

# print a summary of the model
summary(BPRS_model4)
## Linear mixed model fit by maximum likelihood . t-tests use Satterthwaite's
##   method [lmerModLmerTest]
## Formula: bprs ~ week * treatment + (week | subject)
##    Data: BPRS_long
## 
##      AIC      BIC   logLik deviance df.resid 
##   2744.3   2775.4  -1364.1   2728.3      352 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.0512 -0.6271 -0.0768  0.5288  3.9260 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  subject  (Intercept) 64.9964  8.0620        
##           week         0.9687  0.9842   -0.51
##  Residual             96.4707  9.8220        
## Number of obs: 360, groups:  subject, 20
## 
## Fixed effects:
##                 Estimate Std. Error       df t value Pr(>|t|)    
## (Intercept)      47.8856     2.2521  29.6312  21.262  < 2e-16 ***
## week             -2.6283     0.3589  41.7201  -7.323 5.24e-09 ***
## treatment2       -2.2911     1.9090 319.9977  -1.200   0.2310    
## week:treatment2   0.7158     0.4010 319.9977   1.785   0.0752 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr) week   trtmn2
## week        -0.650              
## treatment2  -0.424  0.469       
## wek:trtmnt2  0.356 -0.559 -0.840
plot(BPRS_model4)

# perform an ANOVA test on the 3 models
anova(BPRS_model2,BPRS_model3,BPRS_model4) 
## Data: BPRS_long
## Models:
## BPRS_model2: bprs ~ week + treatment + (1 | subject)
## BPRS_model3: bprs ~ week + treatment + (week | subject)
## BPRS_model4: bprs ~ week * treatment + (week | subject)
##             npar    AIC    BIC  logLik deviance  Chisq Df Pr(>Chisq)  
## BPRS_model2    5 2748.7 2768.1 -1369.4   2738.7                       
## BPRS_model3    7 2745.4 2772.6 -1365.7   2731.4 7.2721  2    0.02636 *
## BPRS_model4    8 2744.3 2775.4 -1364.1   2728.3 3.1712  1    0.07495 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Chi square of new model is too high to be significant and to draw conclusion

Analyze the fitted vs observed values with the model 3

# Create a vector of the fitted values of model 3 - our best model
Fitted_bprs <- fitted(BPRS_model3)

# Create a new column fitted to our BPRS_long dataset
BPRS_new <- BPRS_long %>% 
  mutate(Fitted_bprs = Fitted_bprs)

# draw the plot for the subjects, observed vs fitted values:
ggplot(BPRS_long, aes(x = week)) +
  geom_line(aes(y = bprs, group = subject, linetype = "Observed")) +
  geom_line(aes(y = Fitted_bprs, group = subject, linetype = "Fitted")) +
  facet_grid(. ~ treatment, labeller = label_both) +
  labs(x = "Week", y = "bprs") 

Note: It does not seem to be such a good model to predict the BPRS of subject in each treatment group. The example we saw in the book with the RATS dataset was better.